Skip to content

Backend: nccl_broadcast

nccl_broadcast is the default (and currently only) weight transport backend. It uses a dedicated NCCL process group to broadcast model weights from training rank 0 to all inference TP workers.

Training GPUs
rank 0 ──── NCCL broadcast ────► inference TP rank 1
├─► inference TP rank 2
└─► inference TP rank N
rank 1..K (idle during transfer — only participate in FSDP/EP collectives)

Only training rank 0 participates in the NCCL sync group. All other training ranks are involved only in the per-parameter FSDP unshard that reconstructs the full weight on rank 0 before it gets broadcast.

For each sync call:

  1. NCCL group init — training rank 0 creates a TCPStore at master_address:master_port; inference endpoints join via HTTP (/init_weights_update_group), using rank_offset = 1 + sum(prior endpoint world_sizes). The group is torn down after the sync completes.

  2. Bucket loop — parameters are batched into buffer_size_mb-sized buckets and transferred sequentially:

    • HTTP calls to /update_weights_from_distributed on all endpoints are fired in parallel background threads.
    • dist.broadcast(tensor, src=0, group=sync_group) is called for each tensor. This call blocks until every inference rank has posted its recv, so no external synchronization is needed.
    • The last bucket carries flush_cache=True to signal the inference server to drop its KV cache.
  3. NCCL group destroydist.destroy_process_group on the training side; /destroy_weights_update_group on inference.

TCPStorerendezvousBucket loop (per buffer_size_mb)FSDP unshardall-gather → rank 0dequant + mergeLoRA / optional FP8dist.broadcastsrc=0, group=syncblocks until all recvrepeat for each bucketDestroy groupdestroy_process_group/destroy_weights_…Last bucket: flush_cache=True → inference drops KV cache

Parameters come from the SyncWeightsData request body (see overview):

FieldDefaultDescription
master_address"localhost"IP of the training head node used for TCPStore rendezvous
master_port29600Free port for the TCPStore; must not be in use during sync
group_name"weight_sync_group"Unique NCCL group name; use different names for parallel sync groups
buffer_size_mb1024Bucket size in MB; larger = fewer round-trips, more peak GPU memory

The backend temporarily unsets TORCHELASTIC_USE_AGENT_STORE during group init. This is required because torchrun sets it to True, which forces all ranks to be TCPStore clients. Since rank 0 must be the store master for the weight-sync group, the variable is cleared for the duration of _init_training_process_group and then restored.

NCCL_CUMEM_ENABLE=0 is also set to avoid conflicts with the separate NCCL communicator.

LimitationNotes
Only rank 0 sendsAll parameters are gathered to rank 0 before broadcast. EP experts and PP stage params all route through rank 0.
No direct EP transfersupports_direct_ep_transfer = False; EP shards are gathered first.
No direct PP transfersupports_direct_pp_transfer = False; PP followers ship CPU buffers to rank 0.
Sequential bucketsBuckets are sent one at a time; no pipelining across buckets.

Future backends can lift these restrictions by implementing the WeightTransportBackend interface with supports_direct_ep_transfer = True or sender_ranks returning multiple ranks.

nccl_broadcast implements WeightTransportBackend from src/xorl/server/weight_sync/backends/base.py:

backend = create_backend("nccl_broadcast", config)
ok = backend.initialize() # TCPStore rendezvous, NCCL group init
backend.transfer_bucket(bucket, # list of (name, tensor) pairs
src_rank=0,
flush_cache=False)
backend.destroy() # destroy NCCL group on both sides

sender_ranks returns frozenset({0}) — the handler only extracts and prepares weight buffers on rank 0.

To add a new backend (e.g. shared_storage or ep_direct):

  1. Create src/xorl/server/weight_sync/backends/<name>.py implementing WeightTransportBackend.
  2. Register it in backends/__init__.py:
    def create_backend(method: str, config: TransportConfig, **kwargs):
    if method == "nccl_broadcast":
    ...
    if method == "<name>":
    from .<name> import MyBackend
    return MyBackend(config, **kwargs)
  3. Add the new literal to sync_inference_method in server_arguments.py.
FileDescription
src/xorl/server/weight_sync/backends/nccl_broadcast.pyNcclBroadcastBackend — full bucket loop and group lifecycle