Server config is a flat YAML — all fields at the top level with no nesting, passed to:
python -m xorl.server.launcher --mode auto --config config.yaml
Any field can be overridden on the command line with --server.key value or --server.key=value:
python -m xorl.server.launcher --mode auto --config config.yaml \
--server.pipeline_parallel_size 2 \
--server.expert_parallel_size 4 \
--server.output_dir /shared/outputs \
Field Default Description model_pathrequired HF Hub ID or local path to model weights. model_namesame as model_path Model identifier for validation. config_pathsame as model_path Path to model config. tokenizer_pathsame as config_path Path to tokenizer. attn_implementationflash_attention_3Attention backend: eager, sdpa, native (PyTorch SDPA+cuDNN, no deps, Hopper+Blackwell), flash_attention_3 (FA3, Hopper), flash_attention_4 (FA4 CUTE, Hopper+Blackwell). moe_implementationnullMoE kernel: null (auto), eager, triton, native, quack. ep_dispatchalltoallExpert-parallel dispatch: alltoall or deepep (NVLink-optimized). deepep_buffer_size_gb2.0DeepEP NVLink buffer size per GPU in GB. Only active when ep_dispatch: deepep. deepep_num_sms20SMs assigned to DeepEP communication kernels. Must be even. deepep_async_combinefalseOverlap DeepEP combine with the next layer’s compute (experimental). merge_qkvtrueKeep Q/K/V projections fused. Set false for tensor parallelism. basic_modules[]Additional module names to shard as separate FSDP units. foundation{}Foundation model extra config (dict). encoders{}Multimodal encoder configs, keyed by type (image, video, audio).
These flags align the training model’s numerics with the inference engine (SGLang) to avoid train/inference mismatch.
Field Default Description router_fp32trueUpcast MoE router gate logits to float32 for numerical stability. lm_head_fp32trueUpcast LM head logits to float32. rmsnorm_nativefalseUse unfused PyTorch RMSNorm instead of Triton kernel. activation_nativefalseUse unfused SiLU instead of fused Triton kernel. rope_nativefalseUse unfused RoPE instead of flash_attn kernel. attention_cast_bf16falseExplicitly cast Q/K to BF16 after RoPE.
Field Default Description data_parallel_modefsdp2Data parallelism: none, ddp, fsdp2 (ZeRO-3). data_parallel_shard_size1Number of GPUs per FSDP shard group. data_parallel_replicate_size1Number of data replicas for HSDP. tensor_parallel_size1TP degree. pipeline_parallel_size1PP stages. pipeline_parallel_schedule1F1BPP schedule: 1F1B or GPipe. pp_variable_seq_lengthstrueDynamically negotiate max seq length per PP step via all-reduce. expert_parallel_size1EP degree for MoE models. ulysses_parallel_size1Ulysses context parallelism degree. ringattn_parallel_size1Ring Attention degree. cp_fsdp_modeallSP+FSDP interaction: all, ulysses_only, ring_only, none. reshard_after_forwardtrueReshard FSDP2 parameters after forward.
Field Default Description seed42Random seed. enable_mixed_precisiontrueBF16 mixed-precision training. enable_gradient_checkpointingtrueActivation recomputation to reduce memory. enable_full_shardtrueFSDP2 full parameter sharding (ZeRO-3). enable_activation_offloadfalseOffload activations to CPU. enable_compilefalsetorch.compile for forward pass.enable_reentrantfalseUse reentrant gradient checkpointing. enable_forward_prefetchfalseFSDP forward prefetch. init_devicemetaModel initialization device: cpu, meta, cuda. load_weights_modeautoWeight loading: auto, safetensors, dcp. ce_modecompiledCross-entropy implementation: compiled (recommended, torch.compile) or eager (may OOM at 32K+ seq len).
Field Default Description optimizeradamwOptimizer: adamw, anyprecision_adamw, sgd, muon. optimizer_dtypebf16Dtype for optimizer states: fp32 or bf16. BF16 halves optimizer memory. muon_lr0.02Learning rate for Muon matrix parameter groups. Only used when optimizer: muon. muon_momentum0.95Muon momentum coefficient. muon_nesterovtrueUse Nesterov momentum in Muon. muon_ns_steps5Newton-Schulz iterations for Muon orthogonalization. muon_adjust_lr_fnnullMuon LR scaling: original or match_rms_adamw.
Field Default Description output_diroutputsOutput directory for checkpoints and logs. Must be on shared filesystem for multi-node. ckpt_managerdcpCheckpoint format: dcp or torch. load_checkpoint_path""Path to checkpoint to resume from. Empty string = start fresh. storage_limit10TBMax disk usage for output_dir (e.g., 10GB, 500MB). Saves fail with StorageLimitError when exceeded. idle_session_timeout7200.0Seconds before an idle training session is automatically cleaned up. Default: 2 hours. skip_initial_checkpointfalseSkip saving the initial checkpoint (000000) at startup.
Training data is sent by the client at runtime. These fields control how the server processes it:
Field Default Description sample_packing_sequence_len32000Maximum packed sequence length in tokens. enable_packingtrueCombine multiple samples into a single packed sequence.
Field Default Description log_levelINFOLog verbosity: DEBUG, INFO, WARNING, ERROR. enable_self_testfalseRun a self-test forward/backward pass after model initialization. log_gradient_normstrueLog per-layer-type gradient norms after each backward pass. log_router_statstrueLog MoE router token distribution statistics.
ZMQ communication between the launcher, workers, and API server.
Field Default Description worker_bind_host0.0.0.0Host for rank-0 worker’s ZMQ ROUTER socket. Use 0.0.0.0 for multi-node to accept all interfaces. worker_bind_port5556Port for rank-0 worker’s ZMQ socket. engine_connect_hostnullHost for the engine to connect to rank-0. null = auto (localhost for single-node, file-based for multi-node). worker_bind_addressautoFull ZMQ address (tcp://host:port). auto = pick a free port. worker_connection_timeout120.0Timeout in seconds for worker-engine connection. Increase for slow multi-node setups. worker_max_retries3Max retries for failed worker operations.
Field Default Description enable_lorafalseEnable LoRA adapters. lora_rank32LoRA rank (r). Default is 32 for server (vs 16 for local). lora_alpha16LoRA scaling factor. lora_target_modulesnullModule names to inject LoRA into. null = default for architecture. moe_shared_lorafalseShare LoRA weights across all MoE experts. moe_hybrid_shared_lorafalseShare lora_A for gate/up projections and lora_B for down projections across experts. enable_qlorafalseQuantize base weights and train LoRA adapters on top. quant_formatnvfp4Quantization format: nvfp4, block_fp8. quant_group_size16Quantization group size. qlora_exclude_modulesnullModules to exclude from quantization (e.g., [lm_head]). merge_lora_interval0Merge LoRA into base weights every N steps. 0 = never. reset_optimizer_on_mergefalseReLoRA optimizer reset after merge.
Field Default Description freeze_routertrueFreeze MoE router weights during training. Recommended for fine-tuning to preserve routing learned during pre-training.
Field Default Description sync_inference_methodnccl_broadcastMethod for pushing updated weights to the inference endpoint after each step. Currently only nccl_broadcast is supported (uses SGLang update_weights_from_distributed).