Skip to content

MoE Training

xorl supports Mixture-of-Experts (MoE) models (Qwen3-MoE, DeepSeek-V3, etc.). This section covers the full MoE training stack.


Each MoE layer replaces a dense FFN with a gating network that routes each token to top_k out of num_experts expert FFNs, then combines their outputs by the gate scores.

MoE Layer: Token Routing and Expert DispatchHidden[T, H]Gate + Routergate: [H→E]softmax → top-Knorm_topk_prob opt.top-Knot selectedExpert 0SwiGLU FFNExpert 1SwiGLU FFNExpert knot activated… E experts totalWeighted SumΣ gate[k] × out[k]combine (AllToAll/NVLink)Output[T, H]With EP: each GPU holds E/ep_size experts; tokens dispatched via AllToAll or DeepEP

xorl stores all expert weights as a single fused tensor in GKN layout: [G, K, N] where:

DimensionMeaningSize
G (Group)Expert index — EP slices this dimensionnum_experts (or num_experts / ep_size after sharding)
K (Key)Input dimension of the projectionhidden_size for gate/up; intermediate_size for down
N (Normal)Output dimension of the projectionintermediate_size for gate/up; hidden_size for down

So the three expert weight tensors are:

gate_proj [E, H, I] — gate path: hidden → intermediate
up_proj [E, H, I] — up path: hidden → intermediate
down_proj [E, I, H] — down path: intermediate → hidden

Benefits of GKN:

  • Triton/Quack group GEMM kernels consume GKN natively — the outer G loop maps to the expert axis; K×N tiles stay contiguous in memory for each expert, enabling DRAM-optimal access patterns
  • EP shards along dim 0 (G) — gives each rank a contiguous [E/ep_size, K, N] block with no tensor reordering
  • FSDP2 further shards dim 1 (K) within the EP group — giving [E/ep_size, K/fsdp_size, N] at rest
  • Quantization applies to the fused block — NF4/NVFP4/Block-FP8 quantize the full [E, K, N] tensor preserving cross-expert scale statistics
  • LoRA adapters follow the same layoutlora_A [E, r, K], lora_B [E, N, r] shard identically to base weights

HF MoE checkpoints store experts as nn.ModuleList of nn.Linear layers. xorl automatically converts to fused GKN format during model loading — no separate preprocessing step is needed. Simply point model_path at the standard HuggingFace checkpoint.


PageWhat it covers
RouterTopKRouter algorithm, norm_topk_prob, router_fp32, freeze_router, Routing Replay (R3)
Expert KernelsBackend comparison table, Triton GEMM details, torch.compile guidance, moe_act checkpointing
Expert ParallelismEP mesh, AllToAll dispatch, EP+FSDP2, EP+Ring Attention, example configs
LoRA & QLoRAMoEExpertsLoRA, adapter shapes, hybrid shared LoRA, QLoRA on experts
DeepEPNVLink-optimized dispatch with DeepEP

Example: Qwen3-30B-A3B (PP=2, EP=4, CP=4, 8 GPUs)

Section titled “Example: Qwen3-30B-A3B (PP=2, EP=4, CP=4, 8 GPUs)”
model:
model_path: /path/to/Qwen3-30B-A3B-merge
attn_implementation: flash_attention_3
moe_implementation: triton
merge_qkv: true
router_fp32: true
train:
data_parallel_mode: fsdp2
data_parallel_shard_size: 1
pipeline_parallel_size: 2
pipeline_parallel_schedule: 1F1B
expert_parallel_size: 4
ringattn_parallel_size: 4
enable_gradient_checkpointing: true
moe_checkpoint_method: moe_act
optimizer: muon
muon_lr: 2e-4
lr: 1e-4
gradient_accumulation_steps: 2
enable_mixed_precision: true
reshard_after_forward: true
init_device: meta
model:
model_path: Qwen/Qwen3-235B-A22B-Instruct-FP8
attn_implementation: flash_attention_3
moe_implementation: triton
ep_dispatch: deepep
router_fp32: true
train:
expert_parallel_size: 64
ulysses_parallel_size: 64
data_parallel_shard_size: 1
enable_gradient_checkpointing: true
moe_checkpoint_method: moe_act
enable_activation_offload: true
init_device: meta
load_weights_mode: all_ranks