Parallelism Overview
xorl supports five orthogonal parallelism dimensions that can be freely composed. Each has its own dedicated guide:
| Dimension | Doc | Config field | When to use |
|---|---|---|---|
| Data (FSDP2) | Data Parallelism | data_parallel_mode, data_parallel_shard_size | Always; primary memory reduction tool |
| Tensor | Tensor Parallelism | tensor_parallel_size | When FSDP2 alone can’t fit the model |
| Pipeline | Pipeline Parallelism | pipeline_parallel_size | Very large models across nodes |
| Expert | Expert Parallelism | expert_parallel_size | MoE models |
| Sequence | Context Parallelism | ulysses_parallel_size, ringattn_parallel_size | Long sequences (>32K) |
When to Use Each Dimension
Section titled “When to Use Each Dimension”| Your situation | What to enable | Why |
|---|---|---|
| Model fits on 8 GPUs with FSDP2 | Nothing extra — FSDP2 alone | Simplest setup, lowest overhead |
| Model too large for FSDP2 alone | Add TP (tensor parallelism) | Shards weight matrices within a node |
| Sequences > 32K tokens | Add CP (Ulysses or Ring Attention) | Splits long sequences across GPUs |
| MoE model | Add EP (expert parallelism) | Distributes experts across GPUs |
| Model spans multiple nodes | Add PP (pipeline parallelism) | Splits layers across nodes, minimizes cross-node traffic |
FSDP2 alone covers the majority of use cases. Add other dimensions only when needed.
How the dimensions compose
Section titled “How the dimensions compose”Each dimension slices the GPU grid along a different axis. All five can be active simultaneously — xorl builds a multi-dimensional DeviceMesh that assigns every GPU a unique coordinate across all active axes.
World Size Constraint
Section titled “World Size Constraint”world_size = DP_shard × DP_replicate × TP × PP × CP_ring × CP_ulyssesEP is not part of the main device mesh — it uses a separate per-PP-stage mesh for expert dispatch. CP dimensions (Ring and Ulysses) are folded into the main mesh. In practice, EP ranks overlap with CP ranks within each PP stage (e.g. Qwen3-30B-A3B uses PP=2 with EP=4 and CP=4 sharing the same GPUs).
Quick Reference by Model Size
Section titled “Quick Reference by Model Size”| Model | GPUs | Configuration |
|---|---|---|
| 8B dense | 8 | FSDP2 only |
| 8B dense, 128K context | 8 | FSDP2 + Ring=4 |
| 30B dense | 16 | FSDP2 + PP=2 |
| 30B MoE | 8 | PP=2, EP=4 folded CP=4 |
| 70B dense | 32 | PP=4, FSDP2 HSDP |
| 235B MoE | 64 | EP=64, Ulysses=64 |
Key Constraints
Section titled “Key Constraints”- PP:
gradient_accumulation_steps >= pipeline_parallel_size - EP:
num_experts % expert_parallel_size == 0 - Ulysses:
num_attention_heads % ulysses_parallel_size == 0 - Ring Attention: each packed document length divisible by
2 × ringattn_parallel_size - TP:
num_attention_heads % tensor_parallel_size == 0; requiresmerge_qkv: false
Source
Section titled “Source”src/xorl/distributed/parallel_state.py — ParallelState dataclass and init_parallel_state() which builds the multi-dimensional DeviceMesh from config.