Skip to content

Local Training

Local training uses torchrun to launch distributed processes directly, without an API server. It is best for offline supervised fine-tuning where the full training loop is fixed ahead of time.

Terminal window
torchrun \
--nproc_per_node <GPUS_PER_NODE> \
[--nnodes <NUM_NODES>] \
[--node_rank <NODE_RANK>] \
[--master_addr <HEAD_NODE_IP>] \
[--master_port <PORT>] \
-m xorl.cli.train <config.yaml> [--key.path value ...]
Terminal window
torchrun --nproc_per_node=8 -m xorl.cli.train config.yaml

On the head node:

Terminal window
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
--master_addr=HEAD_IP --master_port=29500 \
-m xorl.cli.train config.yaml

On each worker node:

Terminal window
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=<RANK> \
--master_addr=HEAD_IP --master_port=29500 \
-m xorl.cli.train config.yaml
Datasetfiles.jsonl / HFPrepare /packingtokenizeDataLoadersample_packingseq=8192MicroBatchCollatorgrad_accummicro-batchesGrad accumfwd + bwd × Noptimizer stepsave ckptnext step

Every training config has three top-level sections: model, data, and train. A lora section is added when using LoRA or QLoRA.

model:
model_path: Qwen/Qwen3-8B # HF Hub ID or local path
attn_implementation: flash_attention_3
# moe_implementation: triton # for MoE models
data:
datasets:
- path: /data/train.jsonl
type: tokenized
max_seq_len: 4096
select_columns: [input_ids, labels]
sample_packing_method: sequential
sample_packing_sequence_len: 4096
train:
output_dir: outputs/my_run
data_parallel_mode: fsdp2
micro_batch_size: 1
gradient_accumulation_steps: 4
num_train_epochs: 1
optimizer: adamw
lr: 1e-5
lr_warmup_ratio: 0.05
lr_decay_style: cosine
weight_decay: 0.01
max_grad_norm: 1.0
enable_mixed_precision: true
enable_gradient_checkpointing: true
enable_full_shard: true
init_device: meta
load_weights_mode: broadcast
save_steps: 500
ckpt_manager: dcp
model:
model_path: Qwen/Qwen3-8B

xorl downloads from HF Hub on first use (cached in ~/.cache/huggingface). Pass a local path to avoid re-downloading.

Section titled “Meta Device (recommended for large models)”
train:
init_device: meta
load_weights_mode: broadcast # rank 0 reads, all ranks receive via NCCL

Meta device creates model parameters as zero-cost placeholders. FSDP2 then loads and shards weights directly into each rank’s shard — no rank ever holds the full model in memory. Without meta device, every rank would briefly allocate the entire model before sharding, causing OOM for models larger than ~30B parameters on 80 GB GPUs. Use load_weights_mode: all_ranks for NVMe-fast local storage (each rank reads independently).

ModeDescriptionWhen to use
ddpDistributedDataParallelSmall models, quick experiments
fsdp2FSDP2 (ZeRO-3 equivalent)Large models, default choice
noneSingle processDebugging
train:
data_parallel_mode: fsdp2
enable_full_shard: true # shard params + grads + optimizer states
data_parallel_shard_size: 8 # GPUs per shard group (default: world_size)
data_parallel_replicate_size: 1 # data replicas (HSDP when > 1)

HSDP (Hybrid Sharding): For multi-node, shard within node and replicate across nodes:

data_parallel_shard_size: 8 # shard within 8-GPU node
data_parallel_replicate_size: 4 # 4 node replicas

xorl packs multiple short sequences into a single training bin to maximize GPU utilization.

data:
sample_packing_method: sequential # or multipack (better packing, slower)
sample_packing_sequence_len: 8192 # target packed length per step

Each sample’s max_seq_len gates individual sequence truncation before packing:

datasets:
- path: data.jsonl
type: tokenized
max_seq_len: 4096 # truncate individual samples to 4096 before packing
train:
optimizer: adamw
lr: 1e-5
weight_decay: 0.01
lr_warmup_ratio: 0.05
lr_decay_style: cosine # constant, linear, cosine

Muon applies Newton-Schulz orthogonalization to gradients before the update step, yielding better convergence for transformer weight matrices. Non-matrix parameters (biases, norms, embeddings) fall back to AdamW.

train:
optimizer: muon
lr: 1e-4 # Muon base LR (for non-matrix params, uses AdamW)
muon_lr: 2e-4 # Muon LR for matrix params
muon_momentum: 0.95
muon_ns_steps: 5 # Newton-Schulz iterations

xorl uses PyTorch Distributed Checkpoint Protocol (DCP) by default, which is sharding-aware and supports FSDP2.

train:
ckpt_manager: dcp
save_steps: 500 # save every N steps (0 = off)
save_epochs: 0.5 # save every N epochs (fractional OK)
output_dir: outputs/my_run # checkpoints saved here
load_checkpoint_path: "" # resume from this path (empty = from scratch)
save_hf_weights: false # save HF-format weights (expensive for large models)

Checkpoints are saved to {output_dir}/checkpoints/global_step_{N}/.

train:
enable_gradient_checkpointing: true
recompute_modules: [self_attn, mlp] # selective recompute

For MoE models, moe_checkpoint_method: moe_act recomputes only activations inside expert FFNs, skipping the EP dispatch (faster than full recompute):

train:
moe_checkpoint_method: moe_act
train:
log_format: structured # key=value lines for parsing, or progress_bar (tqdm)
use_wandb: true
wandb_project: my_project
wandb_name: qwen3_8b_ft
wandb_log_interval: 10 # log every N steps

For very large models on limited GPU memory:

train:
enable_activation_offload: true
activation_gpu_limit: 4.0 # keep up to 4 GB of activations on GPU
FileDescription
src/xorl/cli/train.pyCLI entry point (xorl.cli.train)
src/xorl/data/data_loader.pyDataLoader and MicroBatchCollator
src/xorl/data/prepare/Dataset preparation and sample packing