Expert Parallelism
Expert Parallelism (EP) shards expert weights across expert_parallel_size GPUs. Each GPU holds num_experts / ep_size experts — a contiguous slice of the G dimension in GKN layout. Non-expert parameters (attention, norms, gate linear) are replicated across the EP group.
train: expert_parallel_size: 8model: ep_dispatch: alltoall # or deepepFor the full technical reference — dispatch internals, device mesh construction, gradient handling, R3 routing replay, EP composition with FSDP2/PP/CP, and constraint table — see Expert Parallelism (Parallelism section).
EP Forward Flow
Section titled “EP Forward Flow”Dispatch Backends
Section titled “Dispatch Backends”| Backend | Set via | Best for |
|---|---|---|
alltoall | ep_dispatch: alltoall | Any GPU topology, no extra install |
deepep | ep_dispatch: deepep | NVLink clusters, EP ≥ 16 — see DeepEP |
EP + FSDP2 Mesh
Section titled “EP + FSDP2 Mesh”Expert parameters use a 2D mesh [EP, FSDP]; non-expert parameters use a 1D global FSDP mesh:
world_size=16, EP=8, ep_fsdp_size=2: Experts: [EP=8, ep_fsdp=2] — expert dim sharded EP-wise, hidden dim sharded FSDP-wise Non-experts: [FSDP=16] — full FSDP shardingExpert FSDP shards along dim 1 (hidden dimension), orthogonal to EP’s dim 0 (expert axis). EP modules use set_gradient_divide_factor(ep_size) so gradients are correctly averaged across EP ranks during backward.
EP + Ring Attention (Folded Axes)
Section titled “EP + Ring Attention (Folded Axes)”EP and ring attention CP can share the same GPU axis, maximizing NVLink locality:
train: pipeline_parallel_size: 2 expert_parallel_size: 4 ringattn_parallel_size: 4PP=2 × 4 GPUs/stage = 8 GPUs. Within each stage, 4 GPUs are simultaneously EP and ring attention ranks. ep_fsdp_size = 4, so expert FSDP all-gathers happen within the same ring-attention peers.
Configuration Examples
Section titled “Configuration Examples”Qwen3-30B-A3B (PP=2, EP=4, CP=4, 8 GPUs):
model: model_path: /path/to/Qwen3-30B-A3B-merge moe_implementation: triton ep_dispatch: alltoall router_fp32: true
train: pipeline_parallel_size: 2 expert_parallel_size: 4 ringattn_parallel_size: 4 enable_gradient_checkpointing: true moe_checkpoint_method: moe_actQwen3-235B-A22B (EP=64, 8 nodes, DeepEP):
model: model_path: Qwen/Qwen3-235B-A22B-Instruct-FP8 moe_implementation: triton ep_dispatch: deepep router_fp32: true
train: expert_parallel_size: 64 ulysses_parallel_size: 64 enable_gradient_checkpointing: true moe_checkpoint_method: moe_act init_device: meta load_weights_mode: all_ranksRelated Pages
Section titled “Related Pages”| Page | What it covers |
|---|---|
| Expert Parallelism (full reference) | AllToAll internals, DeepEP buffer lifecycle, gradient divide factor, R3 mechanics, EP+PP/CP/FSDP composition, constraints, parameter reference |
| MoE: Expert Kernels | Triton/Quack/native/eager backend details, torch.compile, moe_act checkpointing |
| MoE: Router | Routing algorithm, R3 routing replay in detail |
| DeepEP | NVLink-optimized dispatch, SM tuning, buffer sizing |
| MoE: LoRA & QLoRA | LoRA sharding on expert weights, hybrid shared LoRA |
Source
Section titled “Source”| File | Description |
|---|---|
src/xorl/models/layers/moe/experts.py | MoEExperts — EP forward, GKN weight sharding, dispatch/combine |
src/xorl/models/layers/moe/moe_block.py | MoEBlock — dispatch orchestration, R3 integration |
src/xorl/distributed/parallel_state.py | EP process group and device mesh construction |