QLoRA

For QLoRA with MoE models, see MoE LoRA.

QLoRA stores base model weights in a quantized format and adds trainable LoRA adapters on top. This reduces the GPU memory footprint of the base weights by 2–8× while keeping LoRA parameters in full precision.

Supported Formats

Format	Bits/weight	Group size	Best for
`nvfp4`	4-bit (FP4 E2M1)	16	Hopper (H100) with ModelOpt checkpoints
`block_fp8`	8-bit (FP8 E4M3)	128×128 block	Hopper, from HF FP8 checkpoints
`nf4`	4-bit (NF4)	64	Any GPU, quantizes from BF16 at init

Basic Configuration

lora:
  enable_qlora: true
  quant_format: nf4          # nf4 | nvfp4 | block_fp8
  lora_rank: 16
  lora_alpha: 32
  lora_target_modules:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj

Format Details

NF4

NF4 quantizes BF16 weights on-the-fly during model initialization. No pre-quantized checkpoint required. Works on any GPU.

  quant_format: nf4
  quant_group_size: 64       # quantization block size

NVFP4

NVFP4 uses NVIDIA’s FP4 E2M1 format with per-block scales and a global scale tensor. Requires either a pre-quantized checkpoint (NVIDIA ModelOpt format) or quantizes from BF16 at init.

  quant_format: nvfp4
  quant_group_size: 16

Loading from a pre-quantized ModelOpt checkpoint:

model:
  model_path: /path/to/nvfp4_checkpoint

xorl auto-detects NVFP4 checkpoints and skips re-quantization.

Block FP8

Block-wise FP8 E4M3 quantization. Compatible with HF FP8 checkpoints (e.g., Qwen3-235B-A22B-Instruct-FP8).

  quant_format: block_fp8
  quant_group_size: 128

Key Parameters

Parameter	Default	Description
`enable_qlora`	`false`	Enable QLoRA
`quant_format`	`nvfp4`	Quantization format: `nvfp4`, `block_fp8`, `nf4`
`quant_group_size`	format-dependent	Quantization block size
`lora_rank`	`16`	LoRA rank r
`lora_alpha`	`16`	LoRA scaling factor
`lora_target_modules`	`null`	Module names to inject QLoRA into
`save_lora_only`	`false`	Save only LoRA weights (not packed base)
`exclude_modules`	`null`	Keep these modules in BF16 (e.g. `lm_head`)
`merge_lora_interval`	`0`	Merge LoRA into base every N steps (0 = off)
`reset_optimizer_on_merge`	`false`	Reset optimizer states on merge (ReLoRA)
`enable_aqn`	`false`	Adaptive Quantization Noise
`aqn_alpha`	`1.0`	AQN noise scale

Periodic Merge (ReLoRA)

Periodically merge the LoRA delta back into the quantized base weights, then restart LoRA from scratch. After merging, the base weight stores quant(W + lora_delta) and the correction is recomputed.

  merge_lora_interval: 200    # merge every 200 steps
  reset_optimizer_on_merge: true   # clear Adam momentum/variance (ReLoRA)

Excluding Modules from Quantization

Some layers (e.g. lm_head, embedding) should remain in BF16:

  exclude_modules: [lm_head, embed_tokens]

Adaptive Quantization Noise (AQN)

AQN adds calibrated noise to the dequantized weights during training to regularize against quantization error. Enabled per-format:

  enable_aqn: true
  aqn_alpha: 1.0   # noise magnitude scale

QLoRA with MoE

For MoE models, expert weights are quantized through QLoRAMoeExperts:

lora:
  enable_qlora: true
  quant_format: nvfp4
  lora_rank: 16
  # gate_proj/up_proj/down_proj targets fused expert weights
  lora_target_modules: [qkv_proj, gate_up_proj, down_proj, o_proj]

MoE expert loading uses maybe_load_and_quantize_moe_qlora() which handles EP-sharded expert tensors.

Memory Usage

Approximate memory reduction versus BF16 full fine-tuning (per parameter):

Config	Memory vs BF16
BF16 + LoRA (rank 16)	~100% base + tiny LoRA
NF4 + LoRA (rank 16)	~25% base + tiny LoRA
NVFP4 + LoRA (rank 16)	~12% base + tiny LoRA

Example Configs

See examples/local/dummy/configs/qlora/ and examples/server/configs/qlora/ for complete QLoRA configs.

Source

File	Description
`src/xorl/ops/block_fp8.py`	Block FP8 quantization/dequantization ops
`src/xorl/ops/nf4.py`	NF4 quantization ops
`src/xorl/qlora/`	QLoRA module implementations
`src/xorl/qlora/detect_prequantized.py`	Auto-detect pre-quantized checkpoints