MoE Training

xorl supports Mixture-of-Experts (MoE) models (Qwen3-MoE, DeepSeek-V3, etc.). This section covers the full MoE training stack.

Architecture Overview

Each MoE layer replaces a dense FFN with a gating network that routes each token to top_k out of num_experts expert FFNs, then combines their outputs by the gate scores.

Weight Layout: GKN Format

xorl stores all expert weights as a single fused tensor in GKN layout: [G, K, N] where:

Dimension	Meaning	Size
G (Group)	Expert index — EP slices this dimension	`num_experts` (or `num_experts / ep_size` after sharding)
K (Key)	Input dimension of the projection	`hidden_size` for gate/up; `intermediate_size` for down
N (Normal)	Output dimension of the projection	`intermediate_size` for gate/up; `hidden_size` for down

So the three expert weight tensors are:

gate_proj  [E, H, I]   — gate path: hidden → intermediate
up_proj    [E, H, I]   — up path: hidden → intermediate
down_proj  [E, I, H]   — down path: intermediate → hidden

Benefits of GKN:

Triton/Quack group GEMM kernels consume GKN natively — the outer G loop maps to the expert axis; K×N tiles stay contiguous in memory for each expert, enabling DRAM-optimal access patterns
EP shards along dim 0 (G) — gives each rank a contiguous [E/ep_size, K, N] block with no tensor reordering
FSDP2 further shards dim 1 (K) within the EP group — giving [E/ep_size, K/fsdp_size, N] at rest
Quantization applies to the fused block — NF4/NVFP4/Block-FP8 quantize the full [E, K, N] tensor preserving cross-expert scale statistics
LoRA adapters follow the same layout — lora_A [E, r, K], lora_B [E, N, r] shard identically to base weights

Checkpoint Conversion

HF MoE checkpoints store experts as nn.ModuleList of nn.Linear layers. xorl automatically converts to fused GKN format during model loading — no separate preprocessing step is needed. Simply point model_path at the standard HuggingFace checkpoint.

In This Section

Page	What it covers
Router	TopKRouter algorithm, norm_topk_prob, router_fp32, freeze_router, Routing Replay (R3)
Expert Kernels	Backend comparison table, Triton GEMM details, torch.compile guidance, moe_act checkpointing
Expert Parallelism	EP mesh, AllToAll dispatch, EP+FSDP2, EP+Ring Attention, example configs
LoRA & QLoRA	MoEExpertsLoRA, adapter shapes, hybrid shared LoRA, QLoRA on experts
DeepEP	NVLink-optimized dispatch with DeepEP

Example: Qwen3-30B-A3B (PP=2, EP=4, CP=4, 8 GPUs)

model:
  model_path: /path/to/Qwen3-30B-A3B-merge
  attn_implementation: flash_attention_3
  moe_implementation: triton
  merge_qkv: true
  router_fp32: true

train:
  data_parallel_mode: fsdp2
  data_parallel_shard_size: 1
  pipeline_parallel_size: 2
  pipeline_parallel_schedule: 1F1B
  expert_parallel_size: 4
  ringattn_parallel_size: 4
  enable_gradient_checkpointing: true
  moe_checkpoint_method: moe_act
  optimizer: muon
  muon_lr: 2e-4
  lr: 1e-4
  gradient_accumulation_steps: 2
  enable_mixed_precision: true
  reshard_after_forward: true
  init_device: meta

Example: Qwen3-235B-A22B (EP=64, 8 nodes)

model:
  model_path: Qwen/Qwen3-235B-A22B-Instruct-FP8
  attn_implementation: flash_attention_3
  moe_implementation: triton
  ep_dispatch: deepep
  router_fp32: true

train:
  expert_parallel_size: 64
  ulysses_parallel_size: 64
  data_parallel_shard_size: 1
  enable_gradient_checkpointing: true
  moe_checkpoint_method: moe_act
  enable_activation_offload: true
  init_device: meta
  load_weights_mode: all_ranks