Skip to content

Server Training for RL

Server training exposes the training loop as a REST API, enabling external processes to drive gradient updates step by step. This is the primary mode for online RL training.

Online RL requires a training server, an inference server, and an orchestrator with a reward signal. XoRL provides the first three — you bring the reward.

XoRL RL Architecture

The RL loop:

  1. Generate rollouts — send prompts to xorl-sglang via xorl_client.SamplingClient. Returns completions + per-token logprobs.
  2. Score — your reward model / verifier / environment scores the rollouts.
  3. Train — pack scored rollouts into Datum objects, call forward_backward() + optim_step() on the xorl training server.
  4. Sync weights — broadcast updated weights to xorl-sglang via NCCL. KV cache is flushed automatically.
  5. Repeat with the updated policy.
ComponentProvided byDescription
Training serverxorlForward/backward, optimizer, checkpointing, parallelism
Inference serverxorl-sglangRollout generation, per-token logprobs, weight sync
Client SDKxorl-clientPython SDK: TrainingClient, SamplingClient, RestClient
Reward / environmentYouReward model, code sandbox, math verifier, rule-based scorer

1. Start the training server:

Terminal window
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m xorl.server.launcher \
--mode auto \
--config examples/server/configs/full/qwen3_8b_full.yaml \
--api-port 6000

2. Start xorl-sglang inference:

Terminal window
CUDA_VISIBLE_DEVICES=4 python -m sglang.launch_server \
--model-path Qwen/Qwen3-8B-FP8 \
--port 30000 \
--tp-size 1 \
--rl-on-policy-target xorl \
--enable-fp32-router \
--enable-fp32-lm-head

3. Run training from Python:

import xorl_client
service = xorl_client.ServiceClient(base_url="http://localhost:6000")
client = service.create_training_client(base_model="Qwen/Qwen3-8B")
# Register inference endpoint for weight sync
import requests
requests.post("http://localhost:6000/add_inference_endpoint", json={
"host": "localhost",
"port": 30000,
"worker_port": 30000,
"world_size": 1,
})
# RL loop
for step in range(num_steps):
# ... generate rollouts from sglang, compute rewards ...
fwd = client.forward_backward(data, loss_fn="importance_sampling")
opt = client.optim_step(xorl_client.AdamParams(learning_rate=1e-5))
fwd.result(); opt.result()
if step % sync_interval == 0:
client.sync_inference_weights(
master_address="localhost",
master_port=29600,
).result()

For multi-node setup, server configuration, and launcher CLI options, see Launching & Configuration.

ConfigModelModeGPUs
full/qwen3_8b_full.yamlQwen3-8BFull-weight bf164
lora/qwen3_8b_lora.yamlQwen3-8BLoRA rank 324
qlora/qwen3_8b_qlora_nvfp4.yamlQwen3-8BQLoRA nvfp44
full/qwen3_coder_30b_a3b_full.yamlQwen3-Coder-30B-A3BFull bf16 (SP=4)8
qlora/qwen3_coder_30b_a3b_qlora.yamlQwen3-Coder-30B-A3BQLoRA (EP=4, SP=4)4
full/qwen3_235b_a22b_8node_ep64.yamlQwen3-235B-A22BFull bf16 (EP=64)64
Training modelInference modelTP
Qwen/Qwen3-8BQwen/Qwen3-8B-FP81
Qwen/Qwen3-Coder-30B-A3B-InstructQwen/Qwen3-Coder-30B-A3B-Instruct-FP82
Qwen/Qwen3-235B-A22B-Instruct-2507Qwen/Qwen3-235B-A22B-Instruct-2507-FP84

The xorl-client Python SDK drives the training server — see the Client SDK page for installation, client classes, loss functions, and training loop examples.


Tinker clients can create a usable session with POST /api/v1/create_session. The returned session_id is registered as xorl’s backing model_id, so follow-up requests may send either session_id (Tinker-style) or model_id (xorl-native), and the server normalizes both forms.

EndpointDescription
POST /api/v1/create_sessionCreate/register a Tinker-compatible session ID
POST /api/v1/session_heartbeatRefresh a session’s idle timeout
POST /api/v1/create_modelCreate/register a model session with explicit metadata
POST /api/v1/unload_modelUnload and release a session
POST /api/v1/forward_backwardForward + backward pass
POST /api/v1/optim_stepOptimizer step
POST /api/v1/weights_infoCheckpoint metadata for model loading
GET /api/v1/training_runsList training runs

XoRL’s server training mode is designed for online RL with LLMs. The following papers provide background on the algorithms and system designs that XoRL supports:

PaperDescription
Training language models to follow instructions with human feedback (Ouyang et al., 2022)InstructGPT — the original RLHF pipeline: SFT → reward model → PPO fine-tuning. Established the standard three-stage approach.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning (Shao et al., 2024)Introduces GRPO (Group Relative Policy Optimization) — a simpler alternative to PPO that uses group-level advantages without a value model. XoRL supports GRPO via loss_fn="importance_sampling".
DAPO: An Open-Source LLM Reinforcement Learning System (Yu et al., 2025)Decoupled clip ratios, dynamic sampling, token-level policy gradient, and overlong reward shaping. Demonstrates RL scaling without a value model.
ReMax: A Simple, Effective, and Efficient Method for Aligning LLMs (Li et al., 2023)REINFORCE with a max-reward baseline — no critic needed. Shows that simpler RL methods can match PPO performance with less compute.
PaperDescription
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models (Noukhovitch et al., 2024)Decouples generation and training into async processes. Shows that off-policy RL (training on stale rollouts) works well with proper importance correction — the motivation behind XoRL’s TIS (Temporal Importance Sampling).
INTELLECT-2: A Reasoning Model Trained Through Globally-Distributed Reinforcement Learning (Primeintellect, 2025)Distributed async RL training across globally distributed nodes. Demonstrates that RL training can scale across unreliable, heterogeneous clusters.
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework (Hu et al., 2024)Ray-based RLHF framework with vLLM integration. Demonstrates the separation of training and inference into distinct processes — the same architecture XoRL uses.
verl: Volcano Engine Reinforcement Learning for LLMs (Sheng et al., 2024)Hybrid pipeline that colocates actor and rollout on the same GPUs with memory sharing. Shows how to minimize GPU idle time in the RL loop.
PaperXoRL feature
GLM-5: Open Multilingual Multitask Model (Team GLM, 2025)IcePop — hard gradient masking for extreme importance ratios. Available via icepop_beta parameter in policy_loss.
LoRA: Low-Rank Adaptation of LLMs (Hu et al., 2021)LoRA and multi-adapter support. XoRL supports multiple concurrent LoRA adapters switchable per request.
QLoRA: Efficient Finetuning of Quantized LLMs (Dettmers et al., 2023)QLoRA training with nvfp4, block_fp8, and nf4 quantization formats.

TopicPage
Server architecture, multi-node, launcher CLILaunching & Configuration
REST API endpointsAPI Reference
xorl-sglang: weight sync, numerical alignmentInference: xorl-sglang
Client SDK, loss functions, training patternsClient SDK (xorl-client)
NCCL weight sync protocolWeight Sync
SFT fine-tuning exampleSFT on No Robots
End-to-end weight sync testPassword Memorization