Backend: nccl_broadcast

nccl_broadcast is the default (and currently only) weight transport backend. It uses a dedicated NCCL process group to broadcast model weights from training rank 0 to all inference TP workers.

Architecture

Training GPUs
  rank 0  ──── NCCL broadcast ────► inference TP rank 1
                                 ├─► inference TP rank 2
                                 └─► inference TP rank N
  rank 1..K   (idle during transfer — only participate in FSDP/EP collectives)

Only training rank 0 participates in the NCCL sync group. All other training ranks are involved only in the per-parameter FSDP unshard that reconstructs the full weight on rank 0 before it gets broadcast.

Transfer Flow

For each sync call:

NCCL group init — training rank 0 creates a TCPStore at master_address:master_port; inference endpoints join via HTTP (/init_weights_update_group), using rank_offset = 1 + sum(prior endpoint world_sizes). The group is torn down after the sync completes.
Bucket loop — parameters are batched into buffer_size_mb-sized buckets and transferred sequentially:
- HTTP calls to /update_weights_from_distributed on all endpoints are fired in parallel background threads.
- dist.broadcast(tensor, src=0, group=sync_group) is called for each tensor. This call blocks until every inference rank has posted its recv, so no external synchronization is needed.
- The last bucket carries flush_cache=True to signal the inference server to drop its KV cache.
NCCL group destroy — dist.destroy_process_group on the training side; /destroy_weights_update_group on inference.

Configuration

Parameters come from the SyncWeightsData request body (see overview):

Field	Default	Description
`master_address`	`"localhost"`	IP of the training head node used for TCPStore rendezvous
`master_port`	`29600`	Free port for the TCPStore; must not be in use during sync
`group_name`	`"weight_sync_group"`	Unique NCCL group name; use different names for parallel sync groups
`buffer_size_mb`	`1024`	Bucket size in MB; larger = fewer round-trips, more peak GPU memory

Environment Notes

The backend temporarily unsets TORCHELASTIC_USE_AGENT_STORE during group init. This is required because torchrun sets it to True, which forces all ranks to be TCPStore clients. Since rank 0 must be the store master for the weight-sync group, the variable is cleared for the duration of _init_training_process_group and then restored.

NCCL_CUMEM_ENABLE=0 is also set to avoid conflicts with the separate NCCL communicator.

Limitations

Limitation	Notes
Only rank 0 sends	All parameters are gathered to rank 0 before broadcast. EP experts and PP stage params all route through rank 0.
No direct EP transfer	`supports_direct_ep_transfer = False`; EP shards are gathered first.
No direct PP transfer	`supports_direct_pp_transfer = False`; PP followers ship CPU buffers to rank 0.
Sequential buckets	Buckets are sent one at a time; no pipelining across buckets.

Future backends can lift these restrictions by implementing the WeightTransportBackend interface with supports_direct_ep_transfer = True or sender_ranks returning multiple ranks.

Backend Interface

nccl_broadcast implements WeightTransportBackend from src/xorl/server/weight_sync/backends/base.py:

backend = create_backend("nccl_broadcast", config)
ok = backend.initialize()          # TCPStore rendezvous, NCCL group init
backend.transfer_bucket(bucket,    # list of (name, tensor) pairs
                        src_rank=0,
                        flush_cache=False)
backend.destroy()                  # destroy NCCL group on both sides

sender_ranks returns frozenset({0}) — the handler only extracts and prepares weight buffers on rank 0.

Extending with a New Backend

To add a new backend (e.g. shared_storage or ep_direct):

Create src/xorl/server/weight_sync/backends/<name>.py implementing WeightTransportBackend.

def create_backend(method: str, config: TransportConfig, **kwargs):
    if method == "nccl_broadcast":
        ...
    if method == "<name>":
        from .<name> import MyBackend
        return MyBackend(config, **kwargs)

Add the new literal to sync_inference_method in server_arguments.py.

Source

File	Description
`src/xorl/server/weight_sync/backends/nccl_broadcast.py`	`NcclBroadcastBackend` — full bucket loop and group lifecycle