Weight Sync

Weight sync transfers the current training model weights to one or more inference servers (e.g. SGLang) running on separate GPUs. This is the key mechanism for RL training loops where the policy model must be kept in sync between the training and inference sides.

Overview

Training GPUs (xorl server)
    │  NCCL broadcast
    ▼
Inference GPUs (SGLang)

xorl’s training server broadcasts weights to registered inference endpoints using NCCL. The broadcast group is initialized on the first sync call and reused for subsequent syncs.

Registering Inference Endpoints

Before syncing, register the inference server’s address with xorl:

POST /api/v1/add_inference_endpoint

{"host": "inference-node-01", "port": 30000, "world_size": 8}

Multiple endpoints can be registered (e.g. for multi-replica inference):

for host, port in inference_servers:
    requests.post(f"{base_url}/api/v1/add_inference_endpoint", json={
        "host": host, "port": port, "world_size": 8
    })

Endpoints are deduplicated by (host, port) — registering the same endpoint twice is safe.

Triggering a Sync

POST /api/v1/sync_inference_weights
POST /sync_inference_weights          # shorthand alias

{
  "master_address": "training-node-01",
  "master_port": 29600,
  "group_name": "sync_group_0",
  "buffer_size_mb": 512
}

Field	Description
`master_address`	IP of the training head node (used for NCCL rendezvous)
`master_port`	Free port for the NCCL sync group initialization
`group_name`	Unique name for the NCCL process group (use different names for different inference clusters)
`buffer_size_mb`	NCCL broadcast buffer size per chunk

The sync is synchronous from the caller’s perspective — the endpoint returns once all inference servers have received the weights.

Sync in RL Training Loop

# Every N RL steps, sync weights to inference
if rl_step % sync_every == 0:
    training_client.sync_weights(
        master_address=TRAINING_HEAD_IP,
        master_port=29600,
        group_name="policy_sync",
    )
    # Now SGLang is serving the latest policy weights

Quantization During Sync

Optionally quantize weights during the sync to reduce transfer bandwidth and match inference precision:

POST /api/v1/set_sync_quantization

{
  "quantization": "fp8",      // "bf16" (no quant) or "fp8"
  "skip_modules": ["lm_head", "embed_tokens"]
}

With FP8 quantization, weights are quantized (BF16 → FP8 E4M3) on the training side before broadcasting, then dequantized on the inference side. This halves the sync bandwidth but introduces a small precision loss.

Sync with LoRA

For LoRA training, the sync merges LoRA weights into the base model before broadcasting:

W_full = W_base + lora_B @ lora_A × scaling
The merged BF16 weight is sent to inference

The base weights on the training side are not modified — LoRA parameters remain separate for continued training.

Sync with QLoRA

For QLoRA, the sync dequantizes and merges:

W_full = dequant(W_packed) + correction_U @ correction_B + lora_B @ lora_A × scaling
Optionally re-quantizes to FP8 for transfer

Removing Inference Endpoints

POST /api/v1/remove_inference_endpoint
{"host": "inference-node-01", "port": 30000}

Sleep / Wake Pattern

When the training server needs all GPU memory (e.g. for a large training step), put inference servers to sleep:

# Free inference GPU memory
requests.post(f"{inference_url}/sleep")

# Do training
for _ in range(n_steps):
    training_client.forward_backward(...)
    training_client.optim_step(...)

# Sync and resume inference
training_client.sync_weights(...)
requests.post(f"{inference_url}/wake_up")

Architecture Details

The sync handler (server/weight_sync/handler.py) proceeds in order:

Health check: Verify all inference endpoints are reachable
Backend init: Initialize NCCL broadcast group between training + inference ranks
Per-module transfer (sequential across PP stages if PP > 1):
- FSDP2 all-gather to reconstruct full parameter on rank 0
- QLoRA dequantize + LoRA merge
- Optional FP8 requantization
- NCCL broadcast to all inference ranks
Resume inference: Signal inference servers to resume

Multi-PP Support

For PP-sharded training, each PP stage’s parameters are synced independently in sequence. All PP ranks that hold a given layer participate in the broadcast for that layer.

EP Support

Expert-parallel models: EP ranks send their expert shards directly without gathering to rank 0 first (supports_direct_ep_transfer: true).

Benchmark: Qwen3-235B on 8 Nodes

Config	Sync time	Transfer size
BF16, 64 training GPUs → 8 inference GPUs	~73s	~471 GB
FP8, same topology	~40s	~236 GB

With NVLink: ~22 GB/s effective bandwidth per sync.

Source

File	Description
`src/xorl/server/weight_sync/handler.py`	`WeightSyncHandler` — orchestrates the full sync pipeline
`src/xorl/server/weight_sync/backends/`	Transport backend implementations