Backend: nccl_broadcast
nccl_broadcast is the default (and currently only) weight transport backend. It uses a dedicated NCCL process group to broadcast model weights from training rank 0 to all inference TP workers.
Architecture
Section titled “Architecture”Training GPUs rank 0 ──── NCCL broadcast ────► inference TP rank 1 ├─► inference TP rank 2 └─► inference TP rank N rank 1..K (idle during transfer — only participate in FSDP/EP collectives)Only training rank 0 participates in the NCCL sync group. All other training ranks are involved only in the per-parameter FSDP unshard that reconstructs the full weight on rank 0 before it gets broadcast.
Transfer Flow
Section titled “Transfer Flow”For each sync call:
-
NCCL group init — training rank 0 creates a
TCPStoreatmaster_address:master_port; inference endpoints join via HTTP (/init_weights_update_group), usingrank_offset = 1 + sum(prior endpoint world_sizes). The group is torn down after the sync completes. -
Bucket loop — parameters are batched into
buffer_size_mb-sized buckets and transferred sequentially:- HTTP calls to
/update_weights_from_distributedon all endpoints are fired in parallel background threads. dist.broadcast(tensor, src=0, group=sync_group)is called for each tensor. This call blocks until every inference rank has posted its recv, so no external synchronization is needed.- The last bucket carries
flush_cache=Trueto signal the inference server to drop its KV cache.
- HTTP calls to
-
NCCL group destroy —
dist.destroy_process_groupon the training side;/destroy_weights_update_groupon inference.
Configuration
Section titled “Configuration”Parameters come from the SyncWeightsData request body (see overview):
| Field | Default | Description |
|---|---|---|
master_address | "localhost" | IP of the training head node used for TCPStore rendezvous |
master_port | 29600 | Free port for the TCPStore; must not be in use during sync |
group_name | "weight_sync_group" | Unique NCCL group name; use different names for parallel sync groups |
buffer_size_mb | 1024 | Bucket size in MB; larger = fewer round-trips, more peak GPU memory |
Environment Notes
Section titled “Environment Notes”The backend temporarily unsets TORCHELASTIC_USE_AGENT_STORE during group init. This is required because torchrun sets it to True, which forces all ranks to be TCPStore clients. Since rank 0 must be the store master for the weight-sync group, the variable is cleared for the duration of _init_training_process_group and then restored.
NCCL_CUMEM_ENABLE=0 is also set to avoid conflicts with the separate NCCL communicator.
Limitations
Section titled “Limitations”| Limitation | Notes |
|---|---|
| Only rank 0 sends | All parameters are gathered to rank 0 before broadcast. EP experts and PP stage params all route through rank 0. |
| No direct EP transfer | supports_direct_ep_transfer = False; EP shards are gathered first. |
| No direct PP transfer | supports_direct_pp_transfer = False; PP followers ship CPU buffers to rank 0. |
| Sequential buckets | Buckets are sent one at a time; no pipelining across buckets. |
Future backends can lift these restrictions by implementing the WeightTransportBackend interface with supports_direct_ep_transfer = True or sender_ranks returning multiple ranks.
Backend Interface
Section titled “Backend Interface”nccl_broadcast implements WeightTransportBackend from src/xorl/server/weight_sync/backends/base.py:
backend = create_backend("nccl_broadcast", config)ok = backend.initialize() # TCPStore rendezvous, NCCL group initbackend.transfer_bucket(bucket, # list of (name, tensor) pairs src_rank=0, flush_cache=False)backend.destroy() # destroy NCCL group on both sidessender_ranks returns frozenset({0}) — the handler only extracts and prepares weight buffers on rank 0.
Extending with a New Backend
Section titled “Extending with a New Backend”To add a new backend (e.g. shared_storage or ep_direct):
- Create
src/xorl/server/weight_sync/backends/<name>.pyimplementingWeightTransportBackend. - Register it in
backends/__init__.py:def create_backend(method: str, config: TransportConfig, **kwargs):if method == "nccl_broadcast":...if method == "<name>":from .<name> import MyBackendreturn MyBackend(config, **kwargs) - Add the new literal to
sync_inference_methodinserver_arguments.py.
Source
Section titled “Source”| File | Description |
|---|---|
src/xorl/server/weight_sync/backends/nccl_broadcast.py | NcclBroadcastBackend — full bucket loop and group lifecycle |