Local Training

Local training uses torchrun to launch distributed processes directly, without an API server. It is best for offline supervised fine-tuning where the full training loop is fixed ahead of time.

Launch Command

torchrun \
    --nproc_per_node <GPUS_PER_NODE> \
    [--nnodes <NUM_NODES>] \
    [--node_rank <NODE_RANK>] \
    [--master_addr <HEAD_NODE_IP>] \
    [--master_port <PORT>] \
    -m xorl.cli.train <config.yaml> [--key.path value ...]

Single Node

torchrun --nproc_per_node=8 -m xorl.cli.train config.yaml

Multi-Node (2 nodes, 16 GPUs)

On the head node:

torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 \
    --master_addr=HEAD_IP --master_port=29500 \
    -m xorl.cli.train config.yaml

On each worker node:

torchrun --nproc_per_node=8 --nnodes=2 --node_rank=<RANK> \
    --master_addr=HEAD_IP --master_port=29500 \
    -m xorl.cli.train config.yaml

Config File Structure

Every training config has three top-level sections: model, data, and train. A lora section is added when using LoRA or QLoRA.

model:
  model_path: Qwen/Qwen3-8B    # HF Hub ID or local path
  attn_implementation: flash_attention_3
  # moe_implementation: triton  # for MoE models

data:
  datasets:
    - path: /data/train.jsonl
      type: tokenized
      max_seq_len: 4096
  select_columns: [input_ids, labels]
  sample_packing_method: sequential
  sample_packing_sequence_len: 4096

train:
  output_dir: outputs/my_run
  data_parallel_mode: fsdp2
  micro_batch_size: 1
  gradient_accumulation_steps: 4
  num_train_epochs: 1
  optimizer: adamw
  lr: 1e-5
  lr_warmup_ratio: 0.05
  lr_decay_style: cosine
  weight_decay: 0.01
  max_grad_norm: 1.0
  enable_mixed_precision: true
  enable_gradient_checkpointing: true
  enable_full_shard: true
  init_device: meta
  load_weights_mode: broadcast
  save_steps: 500
  ckpt_manager: dcp

Model Loading

Standard Loading

model:
  model_path: Qwen/Qwen3-8B

xorl downloads from HF Hub on first use (cached in ~/.cache/huggingface). Pass a local path to avoid re-downloading.

Meta Device (recommended for large models)

train:
  init_device: meta
  load_weights_mode: broadcast   # rank 0 reads, all ranks receive via NCCL

Meta device creates model parameters as zero-cost placeholders. FSDP2 then loads and shards weights directly into each rank’s shard — no rank ever holds the full model in memory. Without meta device, every rank would briefly allocate the entire model before sharding, causing OOM for models larger than ~30B parameters on 80 GB GPUs. Use load_weights_mode: all_ranks for NVMe-fast local storage (each rank reads independently).

Data Parallel Modes

Mode	Description	When to use
`ddp`	DistributedDataParallel	Small models, quick experiments
`fsdp2`	FSDP2 (ZeRO-3 equivalent)	Large models, default choice
`none`	Single process	Debugging

FSDP2 Sharding

train:
  data_parallel_mode: fsdp2
  enable_full_shard: true          # shard params + grads + optimizer states
  data_parallel_shard_size: 8      # GPUs per shard group (default: world_size)
  data_parallel_replicate_size: 1  # data replicas (HSDP when > 1)

HSDP (Hybrid Sharding): For multi-node, shard within node and replicate across nodes:

  data_parallel_shard_size: 8       # shard within 8-GPU node
  data_parallel_replicate_size: 4   # 4 node replicas

Sequence Length and Packing

xorl packs multiple short sequences into a single training bin to maximize GPU utilization.

data:
  sample_packing_method: sequential   # or multipack (better packing, slower)
  sample_packing_sequence_len: 8192   # target packed length per step

Each sample’s max_seq_len gates individual sequence truncation before packing:

  datasets:
    - path: data.jsonl
      type: tokenized
      max_seq_len: 4096   # truncate individual samples to 4096 before packing

Optimizer Options

AdamW (default)

train:
  optimizer: adamw
  lr: 1e-5
  weight_decay: 0.01
  lr_warmup_ratio: 0.05
  lr_decay_style: cosine   # constant, linear, cosine

Muon

Muon applies Newton-Schulz orthogonalization to gradients before the update step, yielding better convergence for transformer weight matrices. Non-matrix parameters (biases, norms, embeddings) fall back to AdamW.

train:
  optimizer: muon
  lr: 1e-4            # Muon base LR (for non-matrix params, uses AdamW)
  muon_lr: 2e-4       # Muon LR for matrix params
  muon_momentum: 0.95
  muon_ns_steps: 5    # Newton-Schulz iterations

Checkpointing

xorl uses PyTorch Distributed Checkpoint Protocol (DCP) by default, which is sharding-aware and supports FSDP2.

train:
  ckpt_manager: dcp
  save_steps: 500              # save every N steps (0 = off)
  save_epochs: 0.5             # save every N epochs (fractional OK)
  output_dir: outputs/my_run   # checkpoints saved here
  load_checkpoint_path: ""     # resume from this path (empty = from scratch)
  save_hf_weights: false       # save HF-format weights (expensive for large models)

Checkpoints are saved to {output_dir}/checkpoints/global_step_{N}/.

Gradient Checkpointing

train:
  enable_gradient_checkpointing: true
  recompute_modules: [self_attn, mlp]   # selective recompute

For MoE models, moe_checkpoint_method: moe_act recomputes only activations inside expert FFNs, skipping the EP dispatch (faster than full recompute):

train:
  moe_checkpoint_method: moe_act

Logging

train:
  log_format: structured     # key=value lines for parsing, or progress_bar (tqdm)
  use_wandb: true
  wandb_project: my_project
  wandb_name: qwen3_8b_ft
  wandb_log_interval: 10     # log every N steps

Activation Offloading

For very large models on limited GPU memory:

train:
  enable_activation_offload: true
  activation_gpu_limit: 4.0   # keep up to 4 GB of activations on GPU

Source

File	Description
`src/xorl/cli/train.py`	CLI entry point (`xorl.cli.train`)
`src/xorl/data/data_loader.py`	DataLoader and MicroBatchCollator
`src/xorl/data/prepare/`	Dataset preparation and sample packing