Skip to content

LoRA & QLoRA on Experts

xorl supports LoRA adapters on fused expert weight tensors via MoEExpertsLoRA (src/xorl/models/layers/moe/lora.py). Base weights are frozen; low-rank adapters are trained in GKN layout, matching the sharding of the underlying expert tensors.

For each expert e and projection p ∈ {gate, up, down}:

W_eff[e] = W[e] + scale × (lora_B[e] @ lora_A[e])
where scale = lora_alpha / r

Adapters follow GKN layout:

Projectionlora_A shapelora_B shape
gate_proj[E, H, r][E, r, I]
up_proj[E, H, r][E, r, I]
down_proj[E, I, r][E, r, H]

Where E = num_experts, r = lora_rank, H = hidden_size, I = intermediate_size.


Both adapter tensors shard along dim 0 (E) identically to base weights. An EP rank holds [E/ep_size, r, K] (lora_A) and [E/ep_size, N, r] (lora_B). No extra collective is required — adapter compute is entirely local after dispatch.


Hybrid shared LoRA (moe_hybrid_shared_lora: true) reduces parameter count by sharing one adapter matrix across experts while keeping the other per-expert:

Projectionlora_Alora_BRationale
gate_proj, up_proj[1, r, H] — shared[E, I, r] — per-expertInput space is shared; expert specialization lives in the output
down_proj[E, r, I] — per-expert[1, H, r] — sharedExpert specialization lives in the input; shared output projection

This halves the LoRA parameter count while keeping per-expert expressiveness where it matters most.

# Server training config
moe_shared_lora: false # no sharing
moe_hybrid_shared_lora: true # hybrid shared (recommended for large E)

Standard LoRA on experts:

lora:
enable_lora: true
lora_rank: 16
lora_alpha: 32
lora_target_modules: [qkv_proj, gate_up_proj, down_proj, o_proj]

Hybrid shared LoRA (server training):

lora:
enable_lora: true
lora_rank: 16
lora_alpha: 32
lora_target_modules: [gate_up_proj, down_proj]
moe_hybrid_shared_lora: true

Expert LoRA adapters are checkpointed with the same EP-sharded layout. Use save_lora_only: true to checkpoint only adapter weights.


Expert weights support all quantization formats via QLoRAMoeExperts. The fused [E, K, N] tensor is quantized as a unit:

lora:
enable_qlora: true
quant_format: nvfp4 # or block_fp8, nf4
lora_rank: 16
lora_target_modules: [gate_up_proj, down_proj]

Expert loading uses maybe_load_and_quantize_moe_qlora() which handles EP-sharded expert tensors — each rank quantizes only its local [E/ep_size, K, N] shard. This makes QLoRA memory cost proportional to the EP size.

Supported quant formats on expert tensors:

FormatBitsNotes
nf44-bitNormalFloat4; bitsandbytes-compatible
block_fp88-bitBlock-wise FP8; good throughput on H100
nvfp44-bitNVIDIA FP4 (Hopper); requires Transformer Engine

FileDescription
src/xorl/models/layers/moe/lora.pyMoEExpertsLoRA — per-expert LoRA with hybrid shared option
src/xorl/lora/modules/base.pyBase LoRA module; adapter init and merge logic
src/xorl/qlora/QLoRAMoeExperts — quantized expert weights with LoRA adapters