Skip to content

DeepEP

DeepEP is an NVLink-optimized expert parallelism dispatch backend. It replaces NCCL AllToAll with direct GPU-to-GPU NVLink transfers, significantly reducing dispatch and combine latency for MoE models on NVLink clusters.

  • NVLink-connected GPU cluster (H100 NVLink or DGX H100)
  • deep_ep wheel installed (see Installation)
  • ep_dispatch: deepep in model config
  • For internode (multi-node) EP: nvidia_peermem kernel module loaded and IBGDA enabled on all nodes (see below)
Terminal window
pip install deep_ep-*.whl # from xorl-org/xorl-wheels releases

Verify:

import deep_ep
print("DeepEP available")

For single-node EP (all GPUs on one machine), the wheel alone is sufficient. For multi-node EP, NVSHMEM handles inter-node RDMA communication and requires two additional steps on every node.

nvidia_peermem bridges the NVIDIA driver and the InfiniBand stack to enable GPUDirect RDMA. Without it, NVSHMEM cannot register GPU buffers with IB HCAs and DeepEP will crash at the first dispatch with SIGABRT and errors like:

WARN: device mlx5_0 cannot allocate buffer on the specified memory type. Skipping...

Load it on every node:

Terminal window
sudo modprobe nvidia_peermem

Verify:

Terminal window
lsmod | grep nvidia_peermem

To persist across reboots, add it to /etc/modules:

Terminal window
echo nvidia_peermem | sudo tee -a /etc/modules

IBGDA allows NVSHMEM to issue RDMA operations directly from GPU SM threads without CPU involvement. Add the following to /etc/modprobe.d/nvidia.conf on every node:

options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"

Then rebuild the initramfs and reboot:

Terminal window
sudo update-initramfs -u
sudo reboot

Verify after reboot:

Terminal window
sudo cat /proc/driver/nvidia/params | grep -E "EnableStreamMemOPs|RegistryDwords"
# EnableStreamMemOPs: 1
# RegistryDwords: "PeerMappingOverride=1;"

Important: nvidia_peermem must be reloaded after every reboot (via /etc/modules or modprobe) — the IBGDA driver settings alone do not load it automatically.

model:
ep_dispatch: deepep
deepep_buffer_size_gb: 2.0 # NVLink buffer pool size per GPU (default: 2.0)
deepep_num_sms: 20 # SMs dedicated to communication (default: 20)
deepep_async_combine: false # async combine overlap (experimental)
ParameterDefaultDescription
ep_dispatchalltoallSet to deepep to enable
deepep_buffer_size_gb2.0Per-GPU NVLink buffer pool in GB. Larger = fewer chunked transfers. Rule of thumb: 2 × token_budget × hidden_dim × sizeof(bf16)
deepep_num_sms20SMs dedicated to communication kernels. Must be even.
deepep_async_combinefalseOverlap combine with next layer’s compute (experimental)

Standard AllToAll routes tokens through the CPU-managed NCCL pipeline:

GPU 0 → NCCL buffer → NVLink → NCCL buffer → GPU 1

DeepEP uses RDMA-style direct GPU memory access:

GPU 0 → NVLink (direct write to GPU 1 memory)

This eliminates staging through NCCL buffers and CPU synchronization points, reducing dispatch latency from ~5ms to ~1ms per step on 64-GPU clusters.

Standard AllToAll (NCCL)GPU 0tokensNCCL bufferCPU stagingNCCL bufferCPU stagingGPU1NVLink (via NCCL)CPU sync requiredblocking dispatch~5ms / stepEP=64, Qwen3-235BGPU memory → NCCL staging buffer→ NVLink → NCCL recv buffer → GPU memory4 copies + 2 CPU sync points per transferDeepEP (NVLink Direct)GPU 0tokensGPU 1memoryNVLink RDMA (direct write)No CPU involvementGPU-initiated RDMA~1ms / step5× faster at EP=64GPU memory → NVLink → GPU memorySM kernels manage pipelining + flow control1 copy, no CPU sync points

DeepEP dedicates a fixed number of SMs to communication kernels. This creates a direct tradeoff between communication bandwidth and compute throughput:

deepep_num_smsCommunication bandwidthCompute SMs remainingBest for
8Lower (fewer dispatch threads)More (H100 has 132 SMs)Expert FFN is the bottleneck
20 (default)BalancedBalancedGeneral purpose
32HigherFewerLarge token budgets or high EP sizes
48+MaximumFewerVery large models, EP ≥ 64

deepep_num_sms must be even. Start with 20 and tune based on profiling — if XORL_DEBUG_EP=1 shows dispatch time > compute time, increase SMs; if compute time dominates, decrease.


The NVLink buffer pool is pre-allocated per GPU at startup. If the buffer is too small for a step’s token volume, DeepEP sends in multiple chunks (increasing latency):

optimal_buffer_gb = (tokens_per_rank × hidden_dim × ep_size × 2 bytes) / (1024³)

For Qwen3-235B-A22B (hidden=7168, top_k=8, seq_len=4096, ep_size=64):

tokens_per_rank ≈ 4096 × 8 / 64 = 512 tokens
buffer = 512 × 7168 × 64 × 2 / (1024³) ≈ 0.44 GB → set deepep_buffer_size_gb: 1.0 (2× headroom)

If you see OOM during initialization, reduce deepep_buffer_size_gb. If profiling shows multiple chunked transfers per step, increase it.


deepep_async_combine: true overlaps the combine communication (outputs flowing back from expert ranks) with the next layer’s compute:

Step N: dispatch → compute → [combine starts]
Step N+1: [combine finishes] + next layer compute (overlapped)

Benefit: Hides combine latency behind useful compute, especially when combine > dispatch (typical for large output projections).

Limitations:

  • Experimental — correctness verified on Qwen3 but not all architectures
  • Requires careful ordering of CUDA streams
  • Not compatible with pipeline parallelism (PP > 1)

On Qwen3-235B-A22B with EP=64 (8 nodes, 64 H100 NVLink GPUs):

ConfigDispatch + combine timeThroughput
AllToAll (NCCL)~5s per forward_backwardBaseline
DeepEP (num_sms=20)~1s per forward_backward~5× faster
DeepEP + async_combine~0.6s per forward_backward~8× faster

Gains are most pronounced at high EP sizes (≥ 32) where AllToAll’s O(EP) staging bottleneck dominates.

ScenarioRecommendation
EP ≤ 8, single nodeAllToAll — same physical GPUs, NVLink available via NCCL
EP ≤ 8, multi-node InfiniBandAllToAll — IB already uses RDMA
EP ≥ 16, NVLink clusterDeepEP
EP ≥ 32, NVLink clusterDeepEP strongly recommended
EP ≥ 64, NVLink clusterDeepEP + async_combine

deep_ep not found:

AttributeError: module 'deep_ep' has no attribute 'Buffer'

Install the DeepEP wheel matching your CUDA version from xorl-org/xorl-wheels.

SIGABRT / num_recv_tokens: -1 on all ranks:

WARN: device mlx5_0 cannot allocate buffer on the specified memory type. Skipping...
Global rank: 0, num_recv_tokens: -1, num_rdma_recv_tokens: -1

nvidia_peermem is not loaded. Run sudo modprobe nvidia_peermem on all nodes. See Cluster Prerequisites.

init failed for transport: IBGDA: IBGDA driver settings are not active. Check that NVreg_EnableStreamMemOPs=1 and PeerMappingOverride=1 are set in /etc/modprobe.d/nvidia.conf, then run sudo update-initramfs -u and reboot. Verify with:

Terminal window
sudo cat /proc/driver/nvidia/params | grep EnableStreamMemOPs

Buffer initialization OOM: Reduce deepep_buffer_size_gb. Check available GPU memory before the EP buffer allocation.

SM contention (low compute throughput): Reduce deepep_num_sms to 8–12. Use XORL_DEBUG_EP=1 to print per-phase timing:

Terminal window
XORL_DEBUG_EP=1 torchrun ... -m xorl.cli.train config.yaml

Shape mismatch errors during backward: Ensure moe_checkpoint_method: moe_act is set — R3 routing replay is required for EP + gradient checkpointing. See MoE Routing Replay.

No fallback to AllToAll: If DeepEP fails to initialize, xorl does not fall back automatically. Set ep_dispatch: alltoall explicitly.


FileDescription
src/xorl/models/layers/moe/moe_block.pyMoEBlock — DeepEP dispatch/combine integration, async combine stream management
src/xorl/models/layers/moe/experts.pyMoEExperts._ep_forward() — DeepEP dispatch, compute, combine phases; XORL_DEBUG_EP timing