DeepEP

DeepEP is an NVLink-optimized expert parallelism dispatch backend. It replaces NCCL AllToAll with direct GPU-to-GPU NVLink transfers, significantly reducing dispatch and combine latency for MoE models on NVLink clusters.

Requirements

NVLink-connected GPU cluster (H100 NVLink or DGX H100)
deep_ep wheel installed (see Installation)
ep_dispatch: deepep in model config
For internode (multi-node) EP: nvidia_peermem kernel module loaded and IBGDA enabled on all nodes (see below)

Installation

pip install deep_ep-*.whl   # from xorl-org/xorl-wheels releases

Verify:

import deep_ep
print("DeepEP available")

Cluster Prerequisites (multi-node)

For single-node EP (all GPUs on one machine), the wheel alone is sufficient. For multi-node EP, NVSHMEM handles inter-node RDMA communication and requires two additional steps on every node.

1. Load `nvidia_peermem`

nvidia_peermem bridges the NVIDIA driver and the InfiniBand stack to enable GPUDirect RDMA. Without it, NVSHMEM cannot register GPU buffers with IB HCAs and DeepEP will crash at the first dispatch with SIGABRT and errors like:

WARN: device mlx5_0 cannot allocate buffer on the specified memory type. Skipping...

Load it on every node:

sudo modprobe nvidia_peermem

Verify:

lsmod | grep nvidia_peermem

To persist across reboots, add it to /etc/modules:

echo nvidia_peermem | sudo tee -a /etc/modules

2. Enable IBGDA in the NVIDIA driver

IBGDA allows NVSHMEM to issue RDMA operations directly from GPU SM threads without CPU involvement. Add the following to /etc/modprobe.d/nvidia.conf on every node:

options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"

Then rebuild the initramfs and reboot:

sudo update-initramfs -u
sudo reboot

Verify after reboot:

sudo cat /proc/driver/nvidia/params | grep -E "EnableStreamMemOPs|RegistryDwords"
# EnableStreamMemOPs: 1
# RegistryDwords: "PeerMappingOverride=1;"

Important: nvidia_peermem must be reloaded after every reboot (via /etc/modules or modprobe) — the IBGDA driver settings alone do not load it automatically.

Configuration

model:
  ep_dispatch: deepep
  deepep_buffer_size_gb: 2.0    # NVLink buffer pool size per GPU (default: 2.0)
  deepep_num_sms: 20            # SMs dedicated to communication (default: 20)
  deepep_async_combine: false   # async combine overlap (experimental)

Parameter	Default	Description
`ep_dispatch`	`alltoall`	Set to `deepep` to enable
`deepep_buffer_size_gb`	`2.0`	Per-GPU NVLink buffer pool in GB. Larger = fewer chunked transfers. Rule of thumb: `2 × token_budget × hidden_dim × sizeof(bf16)`
`deepep_num_sms`	`20`	SMs dedicated to communication kernels. Must be even.
`deepep_async_combine`	`false`	Overlap combine with next layer’s compute (experimental)

How DeepEP Works

Standard AllToAll routes tokens through the CPU-managed NCCL pipeline:

GPU 0 → NCCL buffer → NVLink → NCCL buffer → GPU 1

DeepEP uses RDMA-style direct GPU memory access:

GPU 0 → NVLink (direct write to GPU 1 memory)

This eliminates staging through NCCL buffers and CPU synchronization points, reducing dispatch latency from ~5ms to ~1ms per step on 64-GPU clusters.

SM Allocation Strategy

DeepEP dedicates a fixed number of SMs to communication kernels. This creates a direct tradeoff between communication bandwidth and compute throughput:

`deepep_num_sms`	Communication bandwidth	Compute SMs remaining	Best for
8	Lower (fewer dispatch threads)	More (H100 has 132 SMs)	Expert FFN is the bottleneck
20 (default)	Balanced	Balanced	General purpose
32	Higher	Fewer	Large token budgets or high EP sizes
48+	Maximum	Fewer	Very large models, EP ≥ 64

deepep_num_sms must be even. Start with 20 and tune based on profiling — if XORL_DEBUG_EP=1 shows dispatch time > compute time, increase SMs; if compute time dominates, decrease.

Buffer Size Tuning

The NVLink buffer pool is pre-allocated per GPU at startup. If the buffer is too small for a step’s token volume, DeepEP sends in multiple chunks (increasing latency):

optimal_buffer_gb = (tokens_per_rank × hidden_dim × ep_size × 2 bytes) / (1024³)

For Qwen3-235B-A22B (hidden=7168, top_k=8, seq_len=4096, ep_size=64):

tokens_per_rank ≈ 4096 × 8 / 64 = 512 tokens
buffer = 512 × 7168 × 64 × 2 / (1024³) ≈ 0.44 GB → set deepep_buffer_size_gb: 1.0 (2× headroom)

If you see OOM during initialization, reduce deepep_buffer_size_gb. If profiling shows multiple chunked transfers per step, increase it.

Async Combine (Experimental)

deepep_async_combine: true overlaps the combine communication (outputs flowing back from expert ranks) with the next layer’s compute:

Step N:    dispatch → compute → [combine starts]
Step N+1:  [combine finishes] + next layer compute (overlapped)

Benefit: Hides combine latency behind useful compute, especially when combine > dispatch (typical for large output projections).

Limitations:

Experimental — correctness verified on Qwen3 but not all architectures
Requires careful ordering of CUDA streams
Not compatible with pipeline parallelism (PP > 1)

Performance Benchmark

On Qwen3-235B-A22B with EP=64 (8 nodes, 64 H100 NVLink GPUs):

Config	Dispatch + combine time	Throughput
AllToAll (NCCL)	~5s per forward_backward	Baseline
DeepEP (num_sms=20)	~1s per forward_backward	~5× faster
DeepEP + async_combine	~0.6s per forward_backward	~8× faster

Gains are most pronounced at high EP sizes (≥ 32) where AllToAll’s O(EP) staging bottleneck dominates.

When to Use DeepEP vs AllToAll

Scenario	Recommendation
EP ≤ 8, single node	AllToAll — same physical GPUs, NVLink available via NCCL
EP ≤ 8, multi-node InfiniBand	AllToAll — IB already uses RDMA
EP ≥ 16, NVLink cluster	DeepEP
EP ≥ 32, NVLink cluster	DeepEP strongly recommended
EP ≥ 64, NVLink cluster	DeepEP + async_combine

Troubleshooting

deep_ep not found:

AttributeError: module 'deep_ep' has no attribute 'Buffer'

Install the DeepEP wheel matching your CUDA version from xorl-org/xorl-wheels.

SIGABRT / num_recv_tokens: -1 on all ranks:

WARN: device mlx5_0 cannot allocate buffer on the specified memory type. Skipping...
Global rank: 0, num_recv_tokens: -1, num_rdma_recv_tokens: -1

nvidia_peermem is not loaded. Run sudo modprobe nvidia_peermem on all nodes. See Cluster Prerequisites.

init failed for transport: IBGDA: IBGDA driver settings are not active. Check that NVreg_EnableStreamMemOPs=1 and PeerMappingOverride=1 are set in /etc/modprobe.d/nvidia.conf, then run sudo update-initramfs -u and reboot. Verify with:

sudo cat /proc/driver/nvidia/params | grep EnableStreamMemOPs

Buffer initialization OOM: Reduce deepep_buffer_size_gb. Check available GPU memory before the EP buffer allocation.

SM contention (low compute throughput): Reduce deepep_num_sms to 8–12. Use XORL_DEBUG_EP=1 to print per-phase timing:

XORL_DEBUG_EP=1 torchrun ... -m xorl.cli.train config.yaml

Shape mismatch errors during backward: Ensure moe_checkpoint_method: moe_act is set — R3 routing replay is required for EP + gradient checkpointing. See MoE Routing Replay.

No fallback to AllToAll: If DeepEP fails to initialize, xorl does not fall back automatically. Set ep_dispatch: alltoall explicitly.

Source

File	Description
`src/xorl/models/layers/moe/moe_block.py`	`MoEBlock` — DeepEP dispatch/combine integration, async combine stream management
`src/xorl/models/layers/moe/experts.py`	`MoEExperts._ep_forward()` — DeepEP dispatch, compute, combine phases; XORL_DEBUG_EP timing