QLoRA
For QLoRA with MoE models, see MoE LoRA.
QLoRA stores base model weights in a quantized format and adds trainable LoRA adapters on top. This reduces the GPU memory footprint of the base weights by 2–8× while keeping LoRA parameters in full precision.
Supported Formats
Section titled “Supported Formats”| Format | Bits/weight | Group size | Best for |
|---|---|---|---|
nvfp4 | 4-bit (FP4 E2M1) | 16 | Hopper (H100) with ModelOpt checkpoints |
block_fp8 | 8-bit (FP8 E4M3) | 128×128 block | Hopper, from HF FP8 checkpoints |
nf4 | 4-bit (NF4) | 64 | Any GPU, quantizes from BF16 at init |
Basic Configuration
Section titled “Basic Configuration”lora: enable_qlora: true quant_format: nf4 # nf4 | nvfp4 | block_fp8 lora_rank: 16 lora_alpha: 32 lora_target_modules: - q_proj - k_proj - v_proj - o_proj - gate_proj - up_proj - down_projFormat Details
Section titled “Format Details”NF4 quantizes BF16 weights on-the-fly during model initialization. No pre-quantized checkpoint required. Works on any GPU.
quant_format: nf4 quant_group_size: 64 # quantization block sizeNVFP4 uses NVIDIA’s FP4 E2M1 format with per-block scales and a global scale tensor. Requires either a pre-quantized checkpoint (NVIDIA ModelOpt format) or quantizes from BF16 at init.
quant_format: nvfp4 quant_group_size: 16Loading from a pre-quantized ModelOpt checkpoint:
model: model_path: /path/to/nvfp4_checkpointxorl auto-detects NVFP4 checkpoints and skips re-quantization.
Block FP8
Section titled “Block FP8”Block-wise FP8 E4M3 quantization. Compatible with HF FP8 checkpoints (e.g., Qwen3-235B-A22B-Instruct-FP8).
quant_format: block_fp8 quant_group_size: 128Key Parameters
Section titled “Key Parameters”| Parameter | Default | Description |
|---|---|---|
enable_qlora | false | Enable QLoRA |
quant_format | nvfp4 | Quantization format: nvfp4, block_fp8, nf4 |
quant_group_size | format-dependent | Quantization block size |
lora_rank | 16 | LoRA rank r |
lora_alpha | 16 | LoRA scaling factor |
lora_target_modules | null | Module names to inject QLoRA into |
save_lora_only | false | Save only LoRA weights (not packed base) |
exclude_modules | null | Keep these modules in BF16 (e.g. lm_head) |
merge_lora_interval | 0 | Merge LoRA into base every N steps (0 = off) |
reset_optimizer_on_merge | false | Reset optimizer states on merge (ReLoRA) |
enable_aqn | false | Adaptive Quantization Noise |
aqn_alpha | 1.0 | AQN noise scale |
Periodic Merge (ReLoRA)
Section titled “Periodic Merge (ReLoRA)”Periodically merge the LoRA delta back into the quantized base weights, then restart LoRA from scratch. After merging, the base weight stores quant(W + lora_delta) and the correction is recomputed.
merge_lora_interval: 200 # merge every 200 steps reset_optimizer_on_merge: true # clear Adam momentum/variance (ReLoRA)Excluding Modules from Quantization
Section titled “Excluding Modules from Quantization”Some layers (e.g. lm_head, embedding) should remain in BF16:
exclude_modules: [lm_head, embed_tokens]Adaptive Quantization Noise (AQN)
Section titled “Adaptive Quantization Noise (AQN)”AQN adds calibrated noise to the dequantized weights during training to regularize against quantization error. Enabled per-format:
enable_aqn: true aqn_alpha: 1.0 # noise magnitude scaleQLoRA with MoE
Section titled “QLoRA with MoE”For MoE models, expert weights are quantized through QLoRAMoeExperts:
lora: enable_qlora: true quant_format: nvfp4 lora_rank: 16 # gate_proj/up_proj/down_proj targets fused expert weights lora_target_modules: [qkv_proj, gate_up_proj, down_proj, o_proj]MoE expert loading uses maybe_load_and_quantize_moe_qlora() which handles EP-sharded expert tensors.
Memory Usage
Section titled “Memory Usage”Approximate memory reduction versus BF16 full fine-tuning (per parameter):
| Config | Memory vs BF16 |
|---|---|
| BF16 + LoRA (rank 16) | ~100% base + tiny LoRA |
| NF4 + LoRA (rank 16) | ~25% base + tiny LoRA |
| NVFP4 + LoRA (rank 16) | ~12% base + tiny LoRA |
Example Configs
Section titled “Example Configs”See examples/local/dummy/configs/qlora/ and examples/server/configs/qlora/ for complete QLoRA configs.
Source
Section titled “Source”| File | Description |
|---|---|
src/xorl/ops/block_fp8.py | Block FP8 quantization/dequantization ops |
src/xorl/ops/nf4.py | NF4 quantization ops |
src/xorl/qlora/ | QLoRA module implementations |
src/xorl/qlora/detect_prequantized.py | Auto-detect pre-quantized checkpoints |