Skip to content

Weight Sync

Weight sync transfers the current training model weights to one or more inference servers (e.g. SGLang) running on separate GPUs. This is the key mechanism for RL training loops where the policy model must be kept in sync between the training and inference sides.

Training GPUs (xorl server)
│ NCCL broadcast
Inference GPUs (SGLang)

xorl’s training server broadcasts weights to registered inference endpoints using NCCL. The broadcast group is initialized on the first sync call and reused for subsequent syncs.

Training GPUs(FSDP2 shards)shard 0shard 1shard N-1rank 0rank 0 gathersdequant + mergeLoRA (optional FP8)NCCLbroadcastbucket loopInference GPUs(SGLang, TP replicas)infer 0infer 1infer NSequential per-module: FSDP all-gather → dequant/merge LoRA → optional FP8 → broadcast

Before syncing, register the inference server’s address with xorl:

POST /api/v1/add_inference_endpoint
{"host": "inference-node-01", "port": 30000, "world_size": 8}

Multiple endpoints can be registered (e.g. for multi-replica inference):

for host, port in inference_servers:
requests.post(f"{base_url}/api/v1/add_inference_endpoint", json={
"host": host, "port": port, "world_size": 8
})

Endpoints are deduplicated by (host, port) — registering the same endpoint twice is safe.

POST /api/v1/sync_inference_weights
POST /sync_inference_weights # shorthand alias
{
"master_address": "training-node-01",
"master_port": 29600,
"group_name": "sync_group_0",
"buffer_size_mb": 512
}
FieldDescription
master_addressIP of the training head node (used for NCCL rendezvous)
master_portFree port for the NCCL sync group initialization
group_nameUnique name for the NCCL process group (use different names for different inference clusters)
buffer_size_mbNCCL broadcast buffer size per chunk

The sync is synchronous from the caller’s perspective — the endpoint returns once all inference servers have received the weights.

# Every N RL steps, sync weights to inference
if rl_step % sync_every == 0:
training_client.sync_weights(
master_address=TRAINING_HEAD_IP,
master_port=29600,
group_name="policy_sync",
)
# Now SGLang is serving the latest policy weights

Optionally quantize weights during the sync to reduce transfer bandwidth and match inference precision:

POST /api/v1/set_sync_quantization
{
"quantization": "fp8", // "bf16" (no quant) or "fp8"
"skip_modules": ["lm_head", "embed_tokens"]
}

With FP8 quantization, weights are quantized (BF16 → FP8 E4M3) on the training side before broadcasting, then dequantized on the inference side. This halves the sync bandwidth but introduces a small precision loss.

For LoRA training, the sync merges LoRA weights into the base model before broadcasting:

  • W_full = W_base + lora_B @ lora_A × scaling
  • The merged BF16 weight is sent to inference

The base weights on the training side are not modified — LoRA parameters remain separate for continued training.

For QLoRA, the sync dequantizes and merges:

  • W_full = dequant(W_packed) + correction_U @ correction_B + lora_B @ lora_A × scaling
  • Optionally re-quantizes to FP8 for transfer
POST /api/v1/remove_inference_endpoint
{"host": "inference-node-01", "port": 30000}

When the training server needs all GPU memory (e.g. for a large training step), put inference servers to sleep:

# Free inference GPU memory
requests.post(f"{inference_url}/sleep")
# Do training
for _ in range(n_steps):
training_client.forward_backward(...)
training_client.optim_step(...)
# Sync and resume inference
training_client.sync_weights(...)
requests.post(f"{inference_url}/wake_up")

The sync handler (server/weight_sync/handler.py) proceeds in order:

  1. Health check: Verify all inference endpoints are reachable
  2. Backend init: Initialize NCCL broadcast group between training + inference ranks
  3. Per-module transfer (sequential across PP stages if PP > 1):
    • FSDP2 all-gather to reconstruct full parameter on rank 0
    • QLoRA dequantize + LoRA merge
    • Optional FP8 requantization
    • NCCL broadcast to all inference ranks
  4. Resume inference: Signal inference servers to resume

For PP-sharded training, each PP stage’s parameters are synced independently in sequence. All PP ranks that hold a given layer participate in the broadcast for that layer.

Expert-parallel models: EP ranks send their expert shards directly without gathering to rank 0 first (supports_direct_ep_transfer: true).

ConfigSync timeTransfer size
BF16, 64 training GPUs → 8 inference GPUs~73s~471 GB
FP8, same topology~40s~236 GB

With NVLink: ~22 GB/s effective bandwidth per sync.

FileDescription
src/xorl/server/weight_sync/handler.pyWeightSyncHandler — orchestrates the full sync pipeline
src/xorl/server/weight_sync/backends/Transport backend implementations