Expert Parallelism

Expert Parallelism (EP) shards expert weights across expert_parallel_size GPUs. Each GPU holds num_experts / ep_size experts — a contiguous slice of the G dimension in GKN layout. Non-expert parameters (attention, norms, gate linear) are replicated across the EP group.

train:
  expert_parallel_size: 8
model:
  ep_dispatch: alltoall   # or deepep

For the full technical reference — dispatch internals, device mesh construction, gradient handling, R3 routing replay, EP composition with FSDP2/PP/CP, and constraint table — see Expert Parallelism (Parallelism section).

EP Forward Flow

Dispatch Backends

Backend	Set via	Best for
`alltoall`	`ep_dispatch: alltoall`	Any GPU topology, no extra install
`deepep`	`ep_dispatch: deepep`	NVLink clusters, EP ≥ 16 — see DeepEP

EP + FSDP2 Mesh

Expert parameters use a 2D mesh [EP, FSDP]; non-expert parameters use a 1D global FSDP mesh:

world_size=16, EP=8, ep_fsdp_size=2:
  Experts:     [EP=8, ep_fsdp=2]  — expert dim sharded EP-wise, hidden dim sharded FSDP-wise
  Non-experts: [FSDP=16]          — full FSDP sharding

Expert FSDP shards along dim 1 (hidden dimension), orthogonal to EP’s dim 0 (expert axis). EP modules use set_gradient_divide_factor(ep_size) so gradients are correctly averaged across EP ranks during backward.

EP + Ring Attention (Folded Axes)

EP and ring attention CP can share the same GPU axis, maximizing NVLink locality:

train:
  pipeline_parallel_size: 2
  expert_parallel_size: 4
  ringattn_parallel_size: 4

PP=2 × 4 GPUs/stage = 8 GPUs. Within each stage, 4 GPUs are simultaneously EP and ring attention ranks. ep_fsdp_size = 4, so expert FSDP all-gathers happen within the same ring-attention peers.

Configuration Examples

Qwen3-30B-A3B (PP=2, EP=4, CP=4, 8 GPUs):

model:
  model_path: /path/to/Qwen3-30B-A3B-merge
  moe_implementation: triton
  ep_dispatch: alltoall
  router_fp32: true

train:
  pipeline_parallel_size: 2
  expert_parallel_size: 4
  ringattn_parallel_size: 4
  enable_gradient_checkpointing: true
  moe_checkpoint_method: moe_act

Qwen3-235B-A22B (EP=64, 8 nodes, DeepEP):

model:
  model_path: Qwen/Qwen3-235B-A22B-Instruct-FP8
  moe_implementation: triton
  ep_dispatch: deepep
  router_fp32: true

train:
  expert_parallel_size: 64
  ulysses_parallel_size: 64
  enable_gradient_checkpointing: true
  moe_checkpoint_method: moe_act
  init_device: meta
  load_weights_mode: all_ranks

Page	What it covers
Expert Parallelism (full reference)	AllToAll internals, DeepEP buffer lifecycle, gradient divide factor, R3 mechanics, EP+PP/CP/FSDP composition, constraints, parameter reference
MoE: Expert Kernels	Triton/Quack/native/eager backend details, torch.compile, moe_act checkpointing
MoE: Router	Routing algorithm, R3 routing replay in detail
DeepEP	NVLink-optimized dispatch, SM tuning, buffer sizing
MoE: LoRA & QLoRA	LoRA sharding on expert weights, hybrid shared LoRA

Source

File	Description
`src/xorl/models/layers/moe/experts.py`	`MoEExperts` — EP forward, GKN weight sharding, dispatch/combine
`src/xorl/models/layers/moe/moe_block.py`	`MoEBlock` — dispatch orchestration, R3 integration
`src/xorl/distributed/parallel_state.py`	EP process group and device mesh construction