MoE Training
xorl supports Mixture-of-Experts (MoE) models (Qwen3-MoE, DeepSeek-V3, etc.). This section covers the full MoE training stack.
Architecture Overview
Section titled “Architecture Overview”Each MoE layer replaces a dense FFN with a gating network that routes each token to top_k out of num_experts expert FFNs, then combines their outputs by the gate scores.
Weight Layout: GKN Format
Section titled “Weight Layout: GKN Format”xorl stores all expert weights as a single fused tensor in GKN layout: [G, K, N] where:
| Dimension | Meaning | Size |
|---|---|---|
| G (Group) | Expert index — EP slices this dimension | num_experts (or num_experts / ep_size after sharding) |
| K (Key) | Input dimension of the projection | hidden_size for gate/up; intermediate_size for down |
| N (Normal) | Output dimension of the projection | intermediate_size for gate/up; hidden_size for down |
So the three expert weight tensors are:
gate_proj [E, H, I] — gate path: hidden → intermediateup_proj [E, H, I] — up path: hidden → intermediatedown_proj [E, I, H] — down path: intermediate → hiddenBenefits of GKN:
- Triton/Quack group GEMM kernels consume GKN natively — the outer
Gloop maps to the expert axis;K×Ntiles stay contiguous in memory for each expert, enabling DRAM-optimal access patterns - EP shards along dim 0 (G) — gives each rank a contiguous
[E/ep_size, K, N]block with no tensor reordering - FSDP2 further shards dim 1 (K) within the EP group — giving
[E/ep_size, K/fsdp_size, N]at rest - Quantization applies to the fused block — NF4/NVFP4/Block-FP8 quantize the full
[E, K, N]tensor preserving cross-expert scale statistics - LoRA adapters follow the same layout —
lora_A [E, r, K],lora_B [E, N, r]shard identically to base weights
Checkpoint Conversion
Section titled “Checkpoint Conversion”HF MoE checkpoints store experts as nn.ModuleList of nn.Linear layers. xorl automatically converts to fused GKN format during model loading — no separate preprocessing step is needed. Simply point model_path at the standard HuggingFace checkpoint.
In This Section
Section titled “In This Section”| Page | What it covers |
|---|---|
| Router | TopKRouter algorithm, norm_topk_prob, router_fp32, freeze_router, Routing Replay (R3) |
| Expert Kernels | Backend comparison table, Triton GEMM details, torch.compile guidance, moe_act checkpointing |
| Expert Parallelism | EP mesh, AllToAll dispatch, EP+FSDP2, EP+Ring Attention, example configs |
| LoRA & QLoRA | MoEExpertsLoRA, adapter shapes, hybrid shared LoRA, QLoRA on experts |
| DeepEP | NVLink-optimized dispatch with DeepEP |
Example: Qwen3-30B-A3B (PP=2, EP=4, CP=4, 8 GPUs)
Section titled “Example: Qwen3-30B-A3B (PP=2, EP=4, CP=4, 8 GPUs)”model: model_path: /path/to/Qwen3-30B-A3B-merge attn_implementation: flash_attention_3 moe_implementation: triton merge_qkv: true router_fp32: true
train: data_parallel_mode: fsdp2 data_parallel_shard_size: 1 pipeline_parallel_size: 2 pipeline_parallel_schedule: 1F1B expert_parallel_size: 4 ringattn_parallel_size: 4 enable_gradient_checkpointing: true moe_checkpoint_method: moe_act optimizer: muon muon_lr: 2e-4 lr: 1e-4 gradient_accumulation_steps: 2 enable_mixed_precision: true reshard_after_forward: true init_device: metaExample: Qwen3-235B-A22B (EP=64, 8 nodes)
Section titled “Example: Qwen3-235B-A22B (EP=64, 8 nodes)”model: model_path: Qwen/Qwen3-235B-A22B-Instruct-FP8 attn_implementation: flash_attention_3 moe_implementation: triton ep_dispatch: deepep router_fp32: true
train: expert_parallel_size: 64 ulysses_parallel_size: 64 data_parallel_shard_size: 1 enable_gradient_checkpointing: true moe_checkpoint_method: moe_act enable_activation_offload: true init_device: meta load_weights_mode: all_ranks