Skip to content

LoRA

For MoE-specific LoRA (expert weight layout, EP sharding), see MoE LoRA.

LoRA (Low-Rank Adaptation) freezes the base model weights and adds trainable low-rank matrices to selected linear layers. xorl implements FSDP2-compatible LoRA for both local and server training.

For a linear layer with weight matrix W₀ ∈ ℝ^(d×k), LoRA adds a trainable bypass:

W = W₀ + B · A where A ∈ ℝ^(r×k), B ∈ ℝ^(d×r), and r ≪ min(d, k)

W₀ is frozen. Only A and B are updated during training. The output scaling is lora_alpha / r.

LoRA Weight Decomposition: W = W₀ + B·AxW₀ (frozen)d×k · no gradientAr×k ← KaimingBd×r ← zeros× (α/r)lora_alpha/rank+W·xB initialized to zeros → ΔW = 0 at start of training. Training starts from the exact pretrained model behavior.

A is initialized with Kaiming uniform, B with zeros — so the initial LoRA output is exactly zero and training starts from the pretrained model’s behavior.

Add a lora section to your config:

lora:
enable_lora: true
lora_rank: 16
lora_alpha: 32 # scaling = lora_alpha / lora_rank = 2.0
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
save_lora_only: true # checkpoint only LoRA weights, not base model

For QKV-fused models (default merge_qkv: true), target the fused projection names:

lora_target_modules: [qkv_proj, gate_up_proj, down_proj, o_proj]
ParameterDefaultDescription
enable_lorafalseEnable LoRA injection
lora_rank16Rank r of the low-rank decomposition
lora_alpha32Scaling factor; effective LR scale = alpha/rank
lora_target_modulesnullList of module name patterns to inject LoRA into
save_lora_onlyfalseSave only LoRA weights in checkpoints

For MoE models, you can target expert layers as well. Expert LoRA uses fused group GEMM for efficiency:

lora:
enable_lora: true
lora_rank: 16
lora_alpha: 32
lora_target_modules: [qkv_proj, gate_up_proj, down_proj, o_proj]
# Expert layers are included automatically when targeting gate_up_proj/down_proj

With save_lora_only: true, only the LoRA adapter weights are saved. The base model must be available separately to reconstruct the full model.

Checkpoint structure:

outputs/my_run/weights/{run_id}/step_{N}/
├── adapter_config.json
├── adapter_model.safetensors # LoRA weights only
└── training_state/ # optimizer, scheduler, rng state

Compatible with the HF PEFT adapter format for easy loading:

from peft import PeftModel
model = PeftModel.from_pretrained(base_model, "outputs/my_run/weights/.../step_42")

For long training runs, periodically merge LoRA into the base weights and restart with fresh LoRA parameters. This avoids rank saturation and allows the effective rank to grow over training.

For QLoRA, see QLoRA — Periodic Merge.

In server training, LoRA is configured in the server YAML:

enable_lora: true
lora_rank: 16
lora_alpha: 32

The server supports multiple named adapters for multi-task or multi-policy training. Specify a model_id per request to route to a specific adapter:

{
"batches": [...],
"loss_fn": "causallm_loss",
"model_id": "policy_v1"
}

Save a specific adapter:

POST /api/v1/save_weights
{"path": "outputs/adapters/policy_v1", "model_id": "policy_v1"}

See examples/local/dummy/configs/lora/ and examples/server/configs/lora/ for complete LoRA configs across model sizes.

FileDescription
src/xorl/lora/modules/base.pyLoraModule abstract base — defines r, lora_alpha, scaling, and from_module() factory
src/xorl/lora/modules/linear.pyLoraLinear — LoRA for nn.Linear; A (Kaiming), B (zeros), forward with scaling
src/xorl/lora/mapping.pyLORA_MAPPING registry and get_lora_class_for_module() lookup
src/xorl/models/layers/moe/lora.pyMoEExpertsLoRA — LoRA for fused expert tensors [E, I, H]