Skip to content

Expert Parallelism

Expert Parallelism (EP) shards expert weights across expert_parallel_size GPUs. Each GPU holds num_experts / ep_size experts — a contiguous slice of the G dimension in GKN layout. Non-expert parameters (attention, norms, gate linear) are replicated across the EP group.

train:
expert_parallel_size: 8
model:
ep_dispatch: alltoall # or deepep

For the full technical reference — dispatch internals, device mesh construction, gradient handling, R3 routing replay, EP composition with FSDP2/PP/CP, and constraint table — see Expert Parallelism (Parallelism section).


EP Forward: Dispatch → Compute → CombineAll Tokenseach GPU hasfull sequenceDispatchAllToAll (NCCL)or DeepEP (NVLink)route tokens toexpert-owning ranksExpert Computelocal experts onlygroup GEMM on[E/ep_size, K, N]Triton / native / quackCombineAllToAll (NCCL)or DeepEP asyncweighted sum, returnto originating ranksOut[T, H]Dispatch context saved for backward combine. R3 caches routing indices for gradient checkpointing recompute.
BackendSet viaBest for
alltoallep_dispatch: alltoallAny GPU topology, no extra install
deepepep_dispatch: deepepNVLink clusters, EP ≥ 16 — see DeepEP

Expert parameters use a 2D mesh [EP, FSDP]; non-expert parameters use a 1D global FSDP mesh:

world_size=16, EP=8, ep_fsdp_size=2:
Experts: [EP=8, ep_fsdp=2] — expert dim sharded EP-wise, hidden dim sharded FSDP-wise
Non-experts: [FSDP=16] — full FSDP sharding

Expert FSDP shards along dim 1 (hidden dimension), orthogonal to EP’s dim 0 (expert axis). EP modules use set_gradient_divide_factor(ep_size) so gradients are correctly averaged across EP ranks during backward.


EP and ring attention CP can share the same GPU axis, maximizing NVLink locality:

train:
pipeline_parallel_size: 2
expert_parallel_size: 4
ringattn_parallel_size: 4

PP=2 × 4 GPUs/stage = 8 GPUs. Within each stage, 4 GPUs are simultaneously EP and ring attention ranks. ep_fsdp_size = 4, so expert FSDP all-gathers happen within the same ring-attention peers.


Qwen3-30B-A3B (PP=2, EP=4, CP=4, 8 GPUs):

model:
model_path: /path/to/Qwen3-30B-A3B-merge
moe_implementation: triton
ep_dispatch: alltoall
router_fp32: true
train:
pipeline_parallel_size: 2
expert_parallel_size: 4
ringattn_parallel_size: 4
enable_gradient_checkpointing: true
moe_checkpoint_method: moe_act

Qwen3-235B-A22B (EP=64, 8 nodes, DeepEP):

model:
model_path: Qwen/Qwen3-235B-A22B-Instruct-FP8
moe_implementation: triton
ep_dispatch: deepep
router_fp32: true
train:
expert_parallel_size: 64
ulysses_parallel_size: 64
enable_gradient_checkpointing: true
moe_checkpoint_method: moe_act
init_device: meta
load_weights_mode: all_ranks

PageWhat it covers
Expert Parallelism (full reference)AllToAll internals, DeepEP buffer lifecycle, gradient divide factor, R3 mechanics, EP+PP/CP/FSDP composition, constraints, parameter reference
MoE: Expert KernelsTriton/Quack/native/eager backend details, torch.compile, moe_act checkpointing
MoE: RouterRouting algorithm, R3 routing replay in detail
DeepEPNVLink-optimized dispatch, SM tuning, buffer sizing
MoE: LoRA & QLoRALoRA sharding on expert weights, hybrid shared LoRA
FileDescription
src/xorl/models/layers/moe/experts.pyMoEExperts — EP forward, GKN weight sharding, dispatch/combine
src/xorl/models/layers/moe/moe_block.pyMoEBlock — dispatch orchestration, R3 integration
src/xorl/distributed/parallel_state.pyEP process group and device mesh construction