Skip to content

QLoRA

For QLoRA with MoE models, see MoE LoRA.

QLoRA stores base model weights in a quantized format and adds trainable LoRA adapters on top. This reduces the GPU memory footprint of the base weights by 2–8× while keeping LoRA parameters in full precision.

FormatBits/weightGroup sizeBest for
nvfp44-bit (FP4 E2M1)16Hopper (H100) with ModelOpt checkpoints
block_fp88-bit (FP8 E4M3)128×128 blockHopper, from HF FP8 checkpoints
nf44-bit (NF4)64Any GPU, quantizes from BF16 at init
BF16weight WquantizePacked weightNF4 / NVFP4 / FP8Scale tensorper groupLoRA A[K, r] BF16LoRA B[r, M] BF16Correction U, Boptional low-ranktrainable (BF16)frozen (quantized)optionalForward: dequant(W) + correction + LoRA_B @ LoRA_A × scale
lora:
enable_qlora: true
quant_format: nf4 # nf4 | nvfp4 | block_fp8
lora_rank: 16
lora_alpha: 32
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj

NF4 quantizes BF16 weights on-the-fly during model initialization. No pre-quantized checkpoint required. Works on any GPU.

quant_format: nf4
quant_group_size: 64 # quantization block size

NVFP4 uses NVIDIA’s FP4 E2M1 format with per-block scales and a global scale tensor. Requires either a pre-quantized checkpoint (NVIDIA ModelOpt format) or quantizes from BF16 at init.

quant_format: nvfp4
quant_group_size: 16

Loading from a pre-quantized ModelOpt checkpoint:

model:
model_path: /path/to/nvfp4_checkpoint

xorl auto-detects NVFP4 checkpoints and skips re-quantization.

Block-wise FP8 E4M3 quantization. Compatible with HF FP8 checkpoints (e.g., Qwen3-235B-A22B-Instruct-FP8).

quant_format: block_fp8
quant_group_size: 128
ParameterDefaultDescription
enable_qlorafalseEnable QLoRA
quant_formatnvfp4Quantization format: nvfp4, block_fp8, nf4
quant_group_sizeformat-dependentQuantization block size
lora_rank16LoRA rank r
lora_alpha16LoRA scaling factor
lora_target_modulesnullModule names to inject QLoRA into
save_lora_onlyfalseSave only LoRA weights (not packed base)
exclude_modulesnullKeep these modules in BF16 (e.g. lm_head)
merge_lora_interval0Merge LoRA into base every N steps (0 = off)
reset_optimizer_on_mergefalseReset optimizer states on merge (ReLoRA)
enable_aqnfalseAdaptive Quantization Noise
aqn_alpha1.0AQN noise scale

Periodically merge the LoRA delta back into the quantized base weights, then restart LoRA from scratch. After merging, the base weight stores quant(W + lora_delta) and the correction is recomputed.

merge_lora_interval: 200 # merge every 200 steps
reset_optimizer_on_merge: true # clear Adam momentum/variance (ReLoRA)

Some layers (e.g. lm_head, embedding) should remain in BF16:

exclude_modules: [lm_head, embed_tokens]

AQN adds calibrated noise to the dequantized weights during training to regularize against quantization error. Enabled per-format:

enable_aqn: true
aqn_alpha: 1.0 # noise magnitude scale

For MoE models, expert weights are quantized through QLoRAMoeExperts:

lora:
enable_qlora: true
quant_format: nvfp4
lora_rank: 16
# gate_proj/up_proj/down_proj targets fused expert weights
lora_target_modules: [qkv_proj, gate_up_proj, down_proj, o_proj]

MoE expert loading uses maybe_load_and_quantize_moe_qlora() which handles EP-sharded expert tensors.

Approximate memory reduction versus BF16 full fine-tuning (per parameter):

ConfigMemory vs BF16
BF16 + LoRA (rank 16)~100% base + tiny LoRA
NF4 + LoRA (rank 16)~25% base + tiny LoRA
NVFP4 + LoRA (rank 16)~12% base + tiny LoRA

See examples/local/dummy/configs/qlora/ and examples/server/configs/qlora/ for complete QLoRA configs.

FileDescription
src/xorl/ops/block_fp8.pyBlock FP8 quantization/dequantization ops
src/xorl/ops/nf4.pyNF4 quantization ops
src/xorl/qlora/QLoRA module implementations
src/xorl/qlora/detect_prequantized.pyAuto-detect pre-quantized checkpoints