LoRA

For MoE-specific LoRA (expert weight layout, EP sharding), see MoE LoRA.

LoRA (Low-Rank Adaptation) freezes the base model weights and adds trainable low-rank matrices to selected linear layers. xorl implements FSDP2-compatible LoRA for both local and server training.

How LoRA works

For a linear layer with weight matrix W₀ ∈ ℝ^(d×k), LoRA adds a trainable bypass:

W = W₀ + B · A where A ∈ ℝ^(r×k), B ∈ ℝ^(d×r), and r ≪ min(d, k)

W₀ is frozen. Only A and B are updated during training. The output scaling is lora_alpha / r.

A is initialized with Kaiming uniform, B with zeros — so the initial LoRA output is exactly zero and training starts from the pretrained model’s behavior.

Basic Configuration

Add a lora section to your config:

lora:
  enable_lora: true
  lora_rank: 16
  lora_alpha: 32             # scaling = lora_alpha / lora_rank = 2.0
  lora_target_modules:
    - q_proj
    - k_proj
    - v_proj
    - o_proj
    - gate_proj
    - up_proj
    - down_proj
  save_lora_only: true       # checkpoint only LoRA weights, not base model

For QKV-fused models (default merge_qkv: true), target the fused projection names:

  lora_target_modules: [qkv_proj, gate_up_proj, down_proj, o_proj]

Key Parameters

Parameter	Default	Description
`enable_lora`	`false`	Enable LoRA injection
`lora_rank`	`16`	Rank r of the low-rank decomposition
`lora_alpha`	`32`	Scaling factor; effective LR scale = alpha/rank
`lora_target_modules`	`null`	List of module name patterns to inject LoRA into
`save_lora_only`	`false`	Save only LoRA weights in checkpoints

LoRA with MoE Models

For MoE models, you can target expert layers as well. Expert LoRA uses fused group GEMM for efficiency:

lora:
  enable_lora: true
  lora_rank: 16
  lora_alpha: 32
  lora_target_modules: [qkv_proj, gate_up_proj, down_proj, o_proj]
  # Expert layers are included automatically when targeting gate_up_proj/down_proj

Checkpoint Behavior

With save_lora_only: true, only the LoRA adapter weights are saved. The base model must be available separately to reconstruct the full model.

Checkpoint structure:

outputs/my_run/weights/{run_id}/step_{N}/
├── adapter_config.json
├── adapter_model.safetensors   # LoRA weights only
└── training_state/             # optimizer, scheduler, rng state

Compatible with the HF PEFT adapter format for easy loading:

from peft import PeftModel
model = PeftModel.from_pretrained(base_model, "outputs/my_run/weights/.../step_42")

Periodic Merge (ReLoRA)

For long training runs, periodically merge LoRA into the base weights and restart with fresh LoRA parameters. This avoids rank saturation and allows the effective rank to grow over training.

For QLoRA, see QLoRA — Periodic Merge.

LoRA in Server Training

In server training, LoRA is configured in the server YAML:

enable_lora: true
lora_rank: 16
lora_alpha: 32

The server supports multiple named adapters for multi-task or multi-policy training. Specify a model_id per request to route to a specific adapter:

{
  "batches": [...],
  "loss_fn": "causallm_loss",
  "model_id": "policy_v1"
}

Save a specific adapter:

POST /api/v1/save_weights
{"path": "outputs/adapters/policy_v1", "model_id": "policy_v1"}

Example Configs

See examples/local/dummy/configs/lora/ and examples/server/configs/lora/ for complete LoRA configs across model sizes.

Source

File	Description
`src/xorl/lora/modules/base.py`	`LoraModule` abstract base — defines `r`, `lora_alpha`, `scaling`, and `from_module()` factory
`src/xorl/lora/modules/linear.py`	`LoraLinear` — LoRA for `nn.Linear`; A (Kaiming), B (zeros), forward with scaling
`src/xorl/lora/mapping.py`	`LORA_MAPPING` registry and `get_lora_class_for_module()` lookup
`src/xorl/models/layers/moe/lora.py`	`MoEExpertsLoRA` — LoRA for fused expert tensors `[E, I, H]`