Local Training
Local training uses torchrun to launch distributed processes directly, without an API server. It is best for offline supervised fine-tuning where the full training loop is fixed ahead of time.
Launch Command
Section titled “Launch Command”torchrun \ --nproc_per_node <GPUS_PER_NODE> \ [--nnodes <NUM_NODES>] \ [--node_rank <NODE_RANK>] \ [--master_addr <HEAD_NODE_IP>] \ [--master_port <PORT>] \ -m xorl.cli.train <config.yaml> [--key.path value ...]Single Node
Section titled “Single Node”torchrun --nproc_per_node=8 -m xorl.cli.train config.yamlMulti-Node (2 nodes, 16 GPUs)
Section titled “Multi-Node (2 nodes, 16 GPUs)”On the head node:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \ --master_addr=HEAD_IP --master_port=29500 \ -m xorl.cli.train config.yamlOn each worker node:
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=<RANK> \ --master_addr=HEAD_IP --master_port=29500 \ -m xorl.cli.train config.yamlConfig File Structure
Section titled “Config File Structure”Every training config has three top-level sections: model, data, and train. A lora section is added when using LoRA or QLoRA.
model: model_path: Qwen/Qwen3-8B # HF Hub ID or local path attn_implementation: flash_attention_3 # moe_implementation: triton # for MoE models
data: datasets: - path: /data/train.jsonl type: tokenized max_seq_len: 4096 select_columns: [input_ids, labels] sample_packing_method: sequential sample_packing_sequence_len: 4096
train: output_dir: outputs/my_run data_parallel_mode: fsdp2 micro_batch_size: 1 gradient_accumulation_steps: 4 num_train_epochs: 1 optimizer: adamw lr: 1e-5 lr_warmup_ratio: 0.05 lr_decay_style: cosine weight_decay: 0.01 max_grad_norm: 1.0 enable_mixed_precision: true enable_gradient_checkpointing: true enable_full_shard: true init_device: meta load_weights_mode: broadcast save_steps: 500 ckpt_manager: dcpModel Loading
Section titled “Model Loading”Standard Loading
Section titled “Standard Loading”model: model_path: Qwen/Qwen3-8Bxorl downloads from HF Hub on first use (cached in ~/.cache/huggingface). Pass a local path to avoid re-downloading.
Meta Device (recommended for large models)
Section titled “Meta Device (recommended for large models)”train: init_device: meta load_weights_mode: broadcast # rank 0 reads, all ranks receive via NCCLMeta device creates model parameters as zero-cost placeholders. FSDP2 then loads and shards weights directly into each rank’s shard — no rank ever holds the full model in memory. Without meta device, every rank would briefly allocate the entire model before sharding, causing OOM for models larger than ~30B parameters on 80 GB GPUs. Use load_weights_mode: all_ranks for NVMe-fast local storage (each rank reads independently).
Data Parallel Modes
Section titled “Data Parallel Modes”| Mode | Description | When to use |
|---|---|---|
ddp | DistributedDataParallel | Small models, quick experiments |
fsdp2 | FSDP2 (ZeRO-3 equivalent) | Large models, default choice |
none | Single process | Debugging |
FSDP2 Sharding
Section titled “FSDP2 Sharding”train: data_parallel_mode: fsdp2 enable_full_shard: true # shard params + grads + optimizer states data_parallel_shard_size: 8 # GPUs per shard group (default: world_size) data_parallel_replicate_size: 1 # data replicas (HSDP when > 1)HSDP (Hybrid Sharding): For multi-node, shard within node and replicate across nodes:
data_parallel_shard_size: 8 # shard within 8-GPU node data_parallel_replicate_size: 4 # 4 node replicasSequence Length and Packing
Section titled “Sequence Length and Packing”xorl packs multiple short sequences into a single training bin to maximize GPU utilization.
data: sample_packing_method: sequential # or multipack (better packing, slower) sample_packing_sequence_len: 8192 # target packed length per stepEach sample’s max_seq_len gates individual sequence truncation before packing:
datasets: - path: data.jsonl type: tokenized max_seq_len: 4096 # truncate individual samples to 4096 before packingOptimizer Options
Section titled “Optimizer Options”AdamW (default)
Section titled “AdamW (default)”train: optimizer: adamw lr: 1e-5 weight_decay: 0.01 lr_warmup_ratio: 0.05 lr_decay_style: cosine # constant, linear, cosineMuon applies Newton-Schulz orthogonalization to gradients before the update step, yielding better convergence for transformer weight matrices. Non-matrix parameters (biases, norms, embeddings) fall back to AdamW.
train: optimizer: muon lr: 1e-4 # Muon base LR (for non-matrix params, uses AdamW) muon_lr: 2e-4 # Muon LR for matrix params muon_momentum: 0.95 muon_ns_steps: 5 # Newton-Schulz iterationsCheckpointing
Section titled “Checkpointing”xorl uses PyTorch Distributed Checkpoint Protocol (DCP) by default, which is sharding-aware and supports FSDP2.
train: ckpt_manager: dcp save_steps: 500 # save every N steps (0 = off) save_epochs: 0.5 # save every N epochs (fractional OK) output_dir: outputs/my_run # checkpoints saved here load_checkpoint_path: "" # resume from this path (empty = from scratch) save_hf_weights: false # save HF-format weights (expensive for large models)Checkpoints are saved to {output_dir}/checkpoints/global_step_{N}/.
Gradient Checkpointing
Section titled “Gradient Checkpointing”train: enable_gradient_checkpointing: true recompute_modules: [self_attn, mlp] # selective recomputeFor MoE models, moe_checkpoint_method: moe_act recomputes only activations inside expert FFNs, skipping the EP dispatch (faster than full recompute):
train: moe_checkpoint_method: moe_actLogging
Section titled “Logging”train: log_format: structured # key=value lines for parsing, or progress_bar (tqdm) use_wandb: true wandb_project: my_project wandb_name: qwen3_8b_ft wandb_log_interval: 10 # log every N stepsActivation Offloading
Section titled “Activation Offloading”For very large models on limited GPU memory:
train: enable_activation_offload: true activation_gpu_limit: 4.0 # keep up to 4 GB of activations on GPUSource
Section titled “Source”| File | Description |
|---|---|
src/xorl/cli/train.py | CLI entry point (xorl.cli.train) |
src/xorl/data/data_loader.py | DataLoader and MicroBatchCollator |
src/xorl/data/prepare/ | Dataset preparation and sample packing |