Skip to content

Parallelism Overview

xorl supports five orthogonal parallelism dimensions that can be freely composed. Each has its own dedicated guide:

DimensionDocConfig fieldWhen to use
Data (FSDP2)Data Parallelismdata_parallel_mode, data_parallel_shard_sizeAlways; primary memory reduction tool
TensorTensor Parallelismtensor_parallel_sizeWhen FSDP2 alone can’t fit the model
PipelinePipeline Parallelismpipeline_parallel_sizeVery large models across nodes
ExpertExpert Parallelismexpert_parallel_sizeMoE models
SequenceContext Parallelismulysses_parallel_size, ringattn_parallel_sizeLong sequences (>32K)
Your situationWhat to enableWhy
Model fits on 8 GPUs with FSDP2Nothing extra — FSDP2 aloneSimplest setup, lowest overhead
Model too large for FSDP2 aloneAdd TP (tensor parallelism)Shards weight matrices within a node
Sequences > 32K tokensAdd CP (Ulysses or Ring Attention)Splits long sequences across GPUs
MoE modelAdd EP (expert parallelism)Distributes experts across GPUs
Model spans multiple nodesAdd PP (pipeline parallelism)Splits layers across nodes, minimizes cross-node traffic

FSDP2 alone covers the majority of use cases. Add other dimensions only when needed.

Each dimension slices the GPU grid along a different axis. All five can be active simultaneously — xorl builds a multi-dimensional DeviceMesh that assigns every GPU a unique coordinate across all active axes.

5 Parallelism Dimensions — Composition ViewPP — Pipeline Parallel (stages)Stage 0 (layers 0–L/2)Stage 1 (layers L/2–L)DP replica 0FSDP shard ×NDP replica 1FSDP shard ×NDP replica 0FSDP shard ×NDP replica 1FSDP shard ×NTP — Tensor ParallelShards weight matrices within a GPU grouptensor_parallel_size=NCP — Context ParallelSplits sequence across GPUs (Ulysses/Ring)ulysses / ringattn parallel sizeEP — Expert ParallelShards MoE experts across GPUsexpert_parallel_size=NEP uses a separate device mesh; CP folds into the main meshe.g. Qwen3-30B-A3B: PP=2, EP=4, CP=4. EP ranks overlap with CP ranks within each PP stage.world_size = PP × DP_shard × DP_replicate × TP× CP_ring × CP_ulysses
world_size = DP_shard × DP_replicate × TP × PP × CP_ring × CP_ulysses

EP is not part of the main device mesh — it uses a separate per-PP-stage mesh for expert dispatch. CP dimensions (Ring and Ulysses) are folded into the main mesh. In practice, EP ranks overlap with CP ranks within each PP stage (e.g. Qwen3-30B-A3B uses PP=2 with EP=4 and CP=4 sharing the same GPUs).

ModelGPUsConfiguration
8B dense8FSDP2 only
8B dense, 128K context8FSDP2 + Ring=4
30B dense16FSDP2 + PP=2
30B MoE8PP=2, EP=4 folded CP=4
70B dense32PP=4, FSDP2 HSDP
235B MoE64EP=64, Ulysses=64
  • PP: gradient_accumulation_steps >= pipeline_parallel_size
  • EP: num_experts % expert_parallel_size == 0
  • Ulysses: num_attention_heads % ulysses_parallel_size == 0
  • Ring Attention: each packed document length divisible by 2 × ringattn_parallel_size
  • TP: num_attention_heads % tensor_parallel_size == 0; requires merge_qkv: false

src/xorl/distributed/parallel_state.pyParallelState dataclass and init_parallel_state() which builds the multi-dimensional DeviceMesh from config.