Parallelism Overview

xorl supports five orthogonal parallelism dimensions that can be freely composed. Each has its own dedicated guide:

Dimension	Doc	Config field	When to use
Data (FSDP2)	Data Parallelism	`data_parallel_mode`, `data_parallel_shard_size`	Always; primary memory reduction tool
Tensor	Tensor Parallelism	`tensor_parallel_size`	When FSDP2 alone can’t fit the model
Pipeline	Pipeline Parallelism	`pipeline_parallel_size`	Very large models across nodes
Expert	Expert Parallelism	`expert_parallel_size`	MoE models
Sequence	Context Parallelism	`ulysses_parallel_size`, `ringattn_parallel_size`	Long sequences (>32K)

When to Use Each Dimension

Your situation	What to enable	Why
Model fits on 8 GPUs with FSDP2	Nothing extra — FSDP2 alone	Simplest setup, lowest overhead
Model too large for FSDP2 alone	Add TP (tensor parallelism)	Shards weight matrices within a node
Sequences > 32K tokens	Add CP (Ulysses or Ring Attention)	Splits long sequences across GPUs
MoE model	Add EP (expert parallelism)	Distributes experts across GPUs
Model spans multiple nodes	Add PP (pipeline parallelism)	Splits layers across nodes, minimizes cross-node traffic

FSDP2 alone covers the majority of use cases. Add other dimensions only when needed.

How the dimensions compose

Each dimension slices the GPU grid along a different axis. All five can be active simultaneously — xorl builds a multi-dimensional DeviceMesh that assigns every GPU a unique coordinate across all active axes.

World Size Constraint

world_size = DP_shard × DP_replicate × TP × PP × CP_ring × CP_ulysses

EP is not part of the main device mesh — it uses a separate per-PP-stage mesh for expert dispatch. CP dimensions (Ring and Ulysses) are folded into the main mesh. In practice, EP ranks overlap with CP ranks within each PP stage (e.g. Qwen3-30B-A3B uses PP=2 with EP=4 and CP=4 sharing the same GPUs).

Quick Reference by Model Size

Model	GPUs	Configuration
8B dense	8	FSDP2 only
8B dense, 128K context	8	FSDP2 + Ring=4
30B dense	16	FSDP2 + PP=2
30B MoE	8	PP=2, EP=4 folded CP=4
70B dense	32	PP=4, FSDP2 HSDP
235B MoE	64	EP=64, Ulysses=64

Key Constraints

PP: gradient_accumulation_steps >= pipeline_parallel_size
EP: num_experts % expert_parallel_size == 0
Ulysses: num_attention_heads % ulysses_parallel_size == 0
Ring Attention: each packed document length divisible by 2 × ringattn_parallel_size
TP: num_attention_heads % tensor_parallel_size == 0; requires merge_qkv: false

Source

src/xorl/distributed/parallel_state.py — ParallelState dataclass and init_parallel_state() which builds the multi-dimensional DeviceMesh from config.