DeepEP
DeepEP is an NVLink-optimized expert parallelism dispatch backend. It replaces NCCL AllToAll with direct GPU-to-GPU NVLink transfers, significantly reducing dispatch and combine latency for MoE models on NVLink clusters.
Requirements
Section titled “Requirements”- NVLink-connected GPU cluster (H100 NVLink or DGX H100)
deep_epwheel installed (see Installation)ep_dispatch: deepepin model config- For internode (multi-node) EP:
nvidia_peermemkernel module loaded and IBGDA enabled on all nodes (see below)
Installation
Section titled “Installation”pip install deep_ep-*.whl # from xorl-org/xorl-wheels releasesVerify:
import deep_epprint("DeepEP available")Cluster Prerequisites (multi-node)
Section titled “Cluster Prerequisites (multi-node)”For single-node EP (all GPUs on one machine), the wheel alone is sufficient. For multi-node EP, NVSHMEM handles inter-node RDMA communication and requires two additional steps on every node.
1. Load nvidia_peermem
Section titled “1. Load nvidia_peermem”nvidia_peermem bridges the NVIDIA driver and the InfiniBand stack to enable GPUDirect RDMA. Without it, NVSHMEM cannot register GPU buffers with IB HCAs and DeepEP will crash at the first dispatch with SIGABRT and errors like:
WARN: device mlx5_0 cannot allocate buffer on the specified memory type. Skipping...Load it on every node:
sudo modprobe nvidia_peermemVerify:
lsmod | grep nvidia_peermemTo persist across reboots, add it to /etc/modules:
echo nvidia_peermem | sudo tee -a /etc/modules2. Enable IBGDA in the NVIDIA driver
Section titled “2. Enable IBGDA in the NVIDIA driver”IBGDA allows NVSHMEM to issue RDMA operations directly from GPU SM threads without CPU involvement. Add the following to /etc/modprobe.d/nvidia.conf on every node:
options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"Then rebuild the initramfs and reboot:
sudo update-initramfs -usudo rebootVerify after reboot:
sudo cat /proc/driver/nvidia/params | grep -E "EnableStreamMemOPs|RegistryDwords"# EnableStreamMemOPs: 1# RegistryDwords: "PeerMappingOverride=1;"Important:
nvidia_peermemmust be reloaded after every reboot (via/etc/modulesormodprobe) — the IBGDA driver settings alone do not load it automatically.
Configuration
Section titled “Configuration”model: ep_dispatch: deepep deepep_buffer_size_gb: 2.0 # NVLink buffer pool size per GPU (default: 2.0) deepep_num_sms: 20 # SMs dedicated to communication (default: 20) deepep_async_combine: false # async combine overlap (experimental)| Parameter | Default | Description |
|---|---|---|
ep_dispatch | alltoall | Set to deepep to enable |
deepep_buffer_size_gb | 2.0 | Per-GPU NVLink buffer pool in GB. Larger = fewer chunked transfers. Rule of thumb: 2 × token_budget × hidden_dim × sizeof(bf16) |
deepep_num_sms | 20 | SMs dedicated to communication kernels. Must be even. |
deepep_async_combine | false | Overlap combine with next layer’s compute (experimental) |
How DeepEP Works
Section titled “How DeepEP Works”Standard AllToAll routes tokens through the CPU-managed NCCL pipeline:
GPU 0 → NCCL buffer → NVLink → NCCL buffer → GPU 1DeepEP uses RDMA-style direct GPU memory access:
GPU 0 → NVLink (direct write to GPU 1 memory)This eliminates staging through NCCL buffers and CPU synchronization points, reducing dispatch latency from ~5ms to ~1ms per step on 64-GPU clusters.
SM Allocation Strategy
Section titled “SM Allocation Strategy”DeepEP dedicates a fixed number of SMs to communication kernels. This creates a direct tradeoff between communication bandwidth and compute throughput:
deepep_num_sms | Communication bandwidth | Compute SMs remaining | Best for |
|---|---|---|---|
| 8 | Lower (fewer dispatch threads) | More (H100 has 132 SMs) | Expert FFN is the bottleneck |
| 20 (default) | Balanced | Balanced | General purpose |
| 32 | Higher | Fewer | Large token budgets or high EP sizes |
| 48+ | Maximum | Fewer | Very large models, EP ≥ 64 |
deepep_num_sms must be even. Start with 20 and tune based on profiling — if XORL_DEBUG_EP=1 shows dispatch time > compute time, increase SMs; if compute time dominates, decrease.
Buffer Size Tuning
Section titled “Buffer Size Tuning”The NVLink buffer pool is pre-allocated per GPU at startup. If the buffer is too small for a step’s token volume, DeepEP sends in multiple chunks (increasing latency):
optimal_buffer_gb = (tokens_per_rank × hidden_dim × ep_size × 2 bytes) / (1024³)For Qwen3-235B-A22B (hidden=7168, top_k=8, seq_len=4096, ep_size=64):
tokens_per_rank ≈ 4096 × 8 / 64 = 512 tokensbuffer = 512 × 7168 × 64 × 2 / (1024³) ≈ 0.44 GB → set deepep_buffer_size_gb: 1.0 (2× headroom)If you see OOM during initialization, reduce deepep_buffer_size_gb. If profiling shows multiple chunked transfers per step, increase it.
Async Combine (Experimental)
Section titled “Async Combine (Experimental)”deepep_async_combine: true overlaps the combine communication (outputs flowing back from expert ranks) with the next layer’s compute:
Step N: dispatch → compute → [combine starts]Step N+1: [combine finishes] + next layer compute (overlapped)Benefit: Hides combine latency behind useful compute, especially when combine > dispatch (typical for large output projections).
Limitations:
- Experimental — correctness verified on Qwen3 but not all architectures
- Requires careful ordering of CUDA streams
- Not compatible with pipeline parallelism (PP > 1)
Performance Benchmark
Section titled “Performance Benchmark”On Qwen3-235B-A22B with EP=64 (8 nodes, 64 H100 NVLink GPUs):
| Config | Dispatch + combine time | Throughput |
|---|---|---|
| AllToAll (NCCL) | ~5s per forward_backward | Baseline |
| DeepEP (num_sms=20) | ~1s per forward_backward | ~5× faster |
| DeepEP + async_combine | ~0.6s per forward_backward | ~8× faster |
Gains are most pronounced at high EP sizes (≥ 32) where AllToAll’s O(EP) staging bottleneck dominates.
When to Use DeepEP vs AllToAll
Section titled “When to Use DeepEP vs AllToAll”| Scenario | Recommendation |
|---|---|
| EP ≤ 8, single node | AllToAll — same physical GPUs, NVLink available via NCCL |
| EP ≤ 8, multi-node InfiniBand | AllToAll — IB already uses RDMA |
| EP ≥ 16, NVLink cluster | DeepEP |
| EP ≥ 32, NVLink cluster | DeepEP strongly recommended |
| EP ≥ 64, NVLink cluster | DeepEP + async_combine |
Troubleshooting
Section titled “Troubleshooting”deep_ep not found:
AttributeError: module 'deep_ep' has no attribute 'Buffer'Install the DeepEP wheel matching your CUDA version from xorl-org/xorl-wheels.
SIGABRT / num_recv_tokens: -1 on all ranks:
WARN: device mlx5_0 cannot allocate buffer on the specified memory type. Skipping...Global rank: 0, num_recv_tokens: -1, num_rdma_recv_tokens: -1nvidia_peermem is not loaded. Run sudo modprobe nvidia_peermem on all nodes. See Cluster Prerequisites.
init failed for transport: IBGDA:
IBGDA driver settings are not active. Check that NVreg_EnableStreamMemOPs=1 and PeerMappingOverride=1 are set in /etc/modprobe.d/nvidia.conf, then run sudo update-initramfs -u and reboot. Verify with:
sudo cat /proc/driver/nvidia/params | grep EnableStreamMemOPsBuffer initialization OOM:
Reduce deepep_buffer_size_gb. Check available GPU memory before the EP buffer allocation.
SM contention (low compute throughput):
Reduce deepep_num_sms to 8–12. Use XORL_DEBUG_EP=1 to print per-phase timing:
XORL_DEBUG_EP=1 torchrun ... -m xorl.cli.train config.yamlShape mismatch errors during backward:
Ensure moe_checkpoint_method: moe_act is set — R3 routing replay is required for EP + gradient checkpointing. See MoE Routing Replay.
No fallback to AllToAll:
If DeepEP fails to initialize, xorl does not fall back automatically. Set ep_dispatch: alltoall explicitly.
Source
Section titled “Source”| File | Description |
|---|---|
src/xorl/models/layers/moe/moe_block.py | MoEBlock — DeepEP dispatch/combine integration, async combine stream management |
src/xorl/models/layers/moe/experts.py | MoEExperts._ep_forward() — DeepEP dispatch, compute, combine phases; XORL_DEBUG_EP timing |