Weight Sync
Weight sync transfers the current training model weights to one or more inference servers (e.g. SGLang) running on separate GPUs. This is the key mechanism for RL training loops where the policy model must be kept in sync between the training and inference sides.
Overview
Section titled “Overview”Training GPUs (xorl server) │ NCCL broadcast ▼Inference GPUs (SGLang)xorl’s training server broadcasts weights to registered inference endpoints using NCCL. The broadcast group is initialized on the first sync call and reused for subsequent syncs.
Registering Inference Endpoints
Section titled “Registering Inference Endpoints”Before syncing, register the inference server’s address with xorl:
POST /api/v1/add_inference_endpoint{"host": "inference-node-01", "port": 30000, "world_size": 8}Multiple endpoints can be registered (e.g. for multi-replica inference):
for host, port in inference_servers: requests.post(f"{base_url}/api/v1/add_inference_endpoint", json={ "host": host, "port": port, "world_size": 8 })Endpoints are deduplicated by (host, port) — registering the same endpoint twice is safe.
Triggering a Sync
Section titled “Triggering a Sync”POST /api/v1/sync_inference_weightsPOST /sync_inference_weights # shorthand alias{ "master_address": "training-node-01", "master_port": 29600, "group_name": "sync_group_0", "buffer_size_mb": 512}| Field | Description |
|---|---|
master_address | IP of the training head node (used for NCCL rendezvous) |
master_port | Free port for the NCCL sync group initialization |
group_name | Unique name for the NCCL process group (use different names for different inference clusters) |
buffer_size_mb | NCCL broadcast buffer size per chunk |
The sync is synchronous from the caller’s perspective — the endpoint returns once all inference servers have received the weights.
Sync in RL Training Loop
Section titled “Sync in RL Training Loop”# Every N RL steps, sync weights to inferenceif rl_step % sync_every == 0: training_client.sync_weights( master_address=TRAINING_HEAD_IP, master_port=29600, group_name="policy_sync", ) # Now SGLang is serving the latest policy weightsQuantization During Sync
Section titled “Quantization During Sync”Optionally quantize weights during the sync to reduce transfer bandwidth and match inference precision:
POST /api/v1/set_sync_quantization{ "quantization": "fp8", // "bf16" (no quant) or "fp8" "skip_modules": ["lm_head", "embed_tokens"]}With FP8 quantization, weights are quantized (BF16 → FP8 E4M3) on the training side before broadcasting, then dequantized on the inference side. This halves the sync bandwidth but introduces a small precision loss.
Sync with LoRA
Section titled “Sync with LoRA”For LoRA training, the sync merges LoRA weights into the base model before broadcasting:
W_full = W_base + lora_B @ lora_A × scaling- The merged BF16 weight is sent to inference
The base weights on the training side are not modified — LoRA parameters remain separate for continued training.
Sync with QLoRA
Section titled “Sync with QLoRA”For QLoRA, the sync dequantizes and merges:
W_full = dequant(W_packed) + correction_U @ correction_B + lora_B @ lora_A × scaling- Optionally re-quantizes to FP8 for transfer
Removing Inference Endpoints
Section titled “Removing Inference Endpoints”POST /api/v1/remove_inference_endpoint{"host": "inference-node-01", "port": 30000}Sleep / Wake Pattern
Section titled “Sleep / Wake Pattern”When the training server needs all GPU memory (e.g. for a large training step), put inference servers to sleep:
# Free inference GPU memoryrequests.post(f"{inference_url}/sleep")
# Do trainingfor _ in range(n_steps): training_client.forward_backward(...) training_client.optim_step(...)
# Sync and resume inferencetraining_client.sync_weights(...)requests.post(f"{inference_url}/wake_up")Architecture Details
Section titled “Architecture Details”The sync handler (server/weight_sync/handler.py) proceeds in order:
- Health check: Verify all inference endpoints are reachable
- Backend init: Initialize NCCL broadcast group between training + inference ranks
- Per-module transfer (sequential across PP stages if PP > 1):
- FSDP2 all-gather to reconstruct full parameter on rank 0
- QLoRA dequantize + LoRA merge
- Optional FP8 requantization
- NCCL broadcast to all inference ranks
- Resume inference: Signal inference servers to resume
Multi-PP Support
Section titled “Multi-PP Support”For PP-sharded training, each PP stage’s parameters are synced independently in sequence. All PP ranks that hold a given layer participate in the broadcast for that layer.
EP Support
Section titled “EP Support”Expert-parallel models: EP ranks send their expert shards directly without gathering to rank 0 first (supports_direct_ep_transfer: true).
Benchmark: Qwen3-235B on 8 Nodes
Section titled “Benchmark: Qwen3-235B on 8 Nodes”| Config | Sync time | Transfer size |
|---|---|---|
| BF16, 64 training GPUs → 8 inference GPUs | ~73s | ~471 GB |
| FP8, same topology | ~40s | ~236 GB |
With NVLink: ~22 GB/s effective bandwidth per sync.
Source
Section titled “Source”| File | Description |
|---|---|
src/xorl/server/weight_sync/handler.py | WeightSyncHandler — orchestrates the full sync pipeline |
src/xorl/server/weight_sync/backends/ | Transport backend implementations |