Skip to content

Server Training Config

Server config is a flat YAML — all fields at the top level with no nesting, passed to:

Terminal window
python -m xorl.server.launcher --mode auto --config config.yaml

Any field can be overridden on the command line with --server.key value or --server.key=value:

Terminal window
python -m xorl.server.launcher --mode auto --config config.yaml \
--server.pipeline_parallel_size 2 \
--server.expert_parallel_size 4 \
--server.output_dir /shared/outputs \
--server.log_level DEBUG

FieldDefaultDescription
model_pathrequiredHF Hub ID or local path to model weights.
model_namesame as model_pathModel identifier for validation.
config_pathsame as model_pathPath to model config.
tokenizer_pathsame as config_pathPath to tokenizer.
attn_implementationflash_attention_3Attention backend: eager, sdpa, native (PyTorch SDPA+cuDNN, no deps, Hopper+Blackwell), flash_attention_3 (FA3, Hopper), flash_attention_4 (FA4 CUTE, Hopper+Blackwell).
moe_implementationnullMoE kernel: null (auto), eager, triton, native, quack.
ep_dispatchalltoallExpert-parallel dispatch: alltoall or deepep (NVLink-optimized).
deepep_buffer_size_gb2.0DeepEP NVLink buffer size per GPU in GB. Only active when ep_dispatch: deepep.
deepep_num_sms20SMs assigned to DeepEP communication kernels. Must be even.
deepep_async_combinefalseOverlap DeepEP combine with the next layer’s compute (experimental).
merge_qkvtrueKeep Q/K/V projections fused. Set false for tensor parallelism.
basic_modules[]Additional module names to shard as separate FSDP units.
foundation{}Foundation model extra config (dict).
encoders{}Multimodal encoder configs, keyed by type (image, video, audio).

These flags align the training model’s numerics with the inference engine (SGLang) to avoid train/inference mismatch.

FieldDefaultDescription
router_fp32trueUpcast MoE router gate logits to float32 for numerical stability.
lm_head_fp32trueUpcast LM head logits to float32.
rmsnorm_nativefalseUse unfused PyTorch RMSNorm instead of Triton kernel.
activation_nativefalseUse unfused SiLU instead of fused Triton kernel.
rope_nativefalseUse unfused RoPE instead of flash_attn kernel.
attention_cast_bf16falseExplicitly cast Q/K to BF16 after RoPE.

FieldDefaultDescription
data_parallel_modefsdp2Data parallelism: none, ddp, fsdp2 (ZeRO-3).
data_parallel_shard_size1Number of GPUs per FSDP shard group.
data_parallel_replicate_size1Number of data replicas for HSDP.
tensor_parallel_size1TP degree.
pipeline_parallel_size1PP stages.
pipeline_parallel_schedule1F1BPP schedule: 1F1B or GPipe.
pp_variable_seq_lengthstrueDynamically negotiate max seq length per PP step via all-reduce.
expert_parallel_size1EP degree for MoE models.
ulysses_parallel_size1Ulysses context parallelism degree.
ringattn_parallel_size1Ring Attention degree.
cp_fsdp_modeallSP+FSDP interaction: all, ulysses_only, ring_only, none.
reshard_after_forwardtrueReshard FSDP2 parameters after forward.

FieldDefaultDescription
seed42Random seed.
enable_mixed_precisiontrueBF16 mixed-precision training.
enable_gradient_checkpointingtrueActivation recomputation to reduce memory.
enable_full_shardtrueFSDP2 full parameter sharding (ZeRO-3).
enable_activation_offloadfalseOffload activations to CPU.
enable_compilefalsetorch.compile for forward pass.
enable_reentrantfalseUse reentrant gradient checkpointing.
enable_forward_prefetchfalseFSDP forward prefetch.
init_devicemetaModel initialization device: cpu, meta, cuda.
load_weights_modeautoWeight loading: auto, safetensors, dcp.
ce_modecompiledCross-entropy implementation: compiled (recommended, torch.compile) or eager (may OOM at 32K+ seq len).

FieldDefaultDescription
optimizeradamwOptimizer: adamw, anyprecision_adamw, sgd, muon.
optimizer_dtypebf16Dtype for optimizer states: fp32 or bf16. BF16 halves optimizer memory.
muon_lr0.02Learning rate for Muon matrix parameter groups. Only used when optimizer: muon.
muon_momentum0.95Muon momentum coefficient.
muon_nesterovtrueUse Nesterov momentum in Muon.
muon_ns_steps5Newton-Schulz iterations for Muon orthogonalization.
muon_adjust_lr_fnnullMuon LR scaling: original or match_rms_adamw.

FieldDefaultDescription
output_diroutputsOutput directory for checkpoints and logs. Must be on shared filesystem for multi-node.
ckpt_managerdcpCheckpoint format: dcp or torch.
load_checkpoint_path""Path to checkpoint to resume from. Empty string = start fresh.
storage_limit10TBMax disk usage for output_dir (e.g., 10GB, 500MB). Saves fail with StorageLimitError when exceeded.
idle_session_timeout7200.0Seconds before an idle training session is automatically cleaned up. Default: 2 hours.
skip_initial_checkpointfalseSkip saving the initial checkpoint (000000) at startup.

Training data is sent by the client at runtime. These fields control how the server processes it:

FieldDefaultDescription
sample_packing_sequence_len32000Maximum packed sequence length in tokens.
enable_packingtrueCombine multiple samples into a single packed sequence.

FieldDefaultDescription
log_levelINFOLog verbosity: DEBUG, INFO, WARNING, ERROR.
enable_self_testfalseRun a self-test forward/backward pass after model initialization.
log_gradient_normstrueLog per-layer-type gradient norms after each backward pass.
log_router_statstrueLog MoE router token distribution statistics.

ZMQ communication between the launcher, workers, and API server.

FieldDefaultDescription
worker_bind_host0.0.0.0Host for rank-0 worker’s ZMQ ROUTER socket. Use 0.0.0.0 for multi-node to accept all interfaces.
worker_bind_port5556Port for rank-0 worker’s ZMQ socket.
engine_connect_hostnullHost for the engine to connect to rank-0. null = auto (localhost for single-node, file-based for multi-node).
worker_bind_addressautoFull ZMQ address (tcp://host:port). auto = pick a free port.
worker_connection_timeout120.0Timeout in seconds for worker-engine connection. Increase for slow multi-node setups.
worker_max_retries3Max retries for failed worker operations.

FieldDefaultDescription
enable_lorafalseEnable LoRA adapters.
lora_rank32LoRA rank (r). Default is 32 for server (vs 16 for local).
lora_alpha16LoRA scaling factor.
lora_target_modulesnullModule names to inject LoRA into. null = default for architecture.
moe_shared_lorafalseShare LoRA weights across all MoE experts.
moe_hybrid_shared_lorafalseShare lora_A for gate/up projections and lora_B for down projections across experts.
enable_qlorafalseQuantize base weights and train LoRA adapters on top.
quant_formatnvfp4Quantization format: nvfp4, block_fp8.
quant_group_size16Quantization group size.
qlora_exclude_modulesnullModules to exclude from quantization (e.g., [lm_head]).
merge_lora_interval0Merge LoRA into base weights every N steps. 0 = never.
reset_optimizer_on_mergefalseReLoRA optimizer reset after merge.

FieldDefaultDescription
freeze_routertrueFreeze MoE router weights during training. Recommended for fine-tuning to preserve routing learned during pre-training.

FieldDefaultDescription
sync_inference_methodnccl_broadcastMethod for pushing updated weights to the inference endpoint after each step. Currently only nccl_broadcast is supported (uses SGLang update_weights_from_distributed).