LoRA
For MoE-specific LoRA (expert weight layout, EP sharding), see MoE LoRA.
LoRA (Low-Rank Adaptation) freezes the base model weights and adds trainable low-rank matrices to selected linear layers. xorl implements FSDP2-compatible LoRA for both local and server training.
How LoRA works
Section titled “How LoRA works”For a linear layer with weight matrix W₀ ∈ ℝ^(d×k), LoRA adds a trainable bypass:
W = W₀ + B · A where A ∈ ℝ^(r×k), B ∈ ℝ^(d×r), and r ≪ min(d, k)
W₀ is frozen. Only A and B are updated during training. The output scaling is lora_alpha / r.
A is initialized with Kaiming uniform, B with zeros — so the initial LoRA output is exactly zero and training starts from the pretrained model’s behavior.
Basic Configuration
Section titled “Basic Configuration”Add a lora section to your config:
lora: enable_lora: true lora_rank: 16 lora_alpha: 32 # scaling = lora_alpha / lora_rank = 2.0 lora_target_modules: - q_proj - k_proj - v_proj - o_proj - gate_proj - up_proj - down_proj save_lora_only: true # checkpoint only LoRA weights, not base modelFor QKV-fused models (default merge_qkv: true), target the fused projection names:
lora_target_modules: [qkv_proj, gate_up_proj, down_proj, o_proj]Key Parameters
Section titled “Key Parameters”| Parameter | Default | Description |
|---|---|---|
enable_lora | false | Enable LoRA injection |
lora_rank | 16 | Rank r of the low-rank decomposition |
lora_alpha | 32 | Scaling factor; effective LR scale = alpha/rank |
lora_target_modules | null | List of module name patterns to inject LoRA into |
save_lora_only | false | Save only LoRA weights in checkpoints |
LoRA with MoE Models
Section titled “LoRA with MoE Models”For MoE models, you can target expert layers as well. Expert LoRA uses fused group GEMM for efficiency:
lora: enable_lora: true lora_rank: 16 lora_alpha: 32 lora_target_modules: [qkv_proj, gate_up_proj, down_proj, o_proj] # Expert layers are included automatically when targeting gate_up_proj/down_projCheckpoint Behavior
Section titled “Checkpoint Behavior”With save_lora_only: true, only the LoRA adapter weights are saved. The base model must be available separately to reconstruct the full model.
Checkpoint structure:
outputs/my_run/weights/{run_id}/step_{N}/├── adapter_config.json├── adapter_model.safetensors # LoRA weights only└── training_state/ # optimizer, scheduler, rng stateCompatible with the HF PEFT adapter format for easy loading:
from peft import PeftModelmodel = PeftModel.from_pretrained(base_model, "outputs/my_run/weights/.../step_42")Periodic Merge (ReLoRA)
Section titled “Periodic Merge (ReLoRA)”For long training runs, periodically merge LoRA into the base weights and restart with fresh LoRA parameters. This avoids rank saturation and allows the effective rank to grow over training.
For QLoRA, see QLoRA — Periodic Merge.
LoRA in Server Training
Section titled “LoRA in Server Training”In server training, LoRA is configured in the server YAML:
enable_lora: truelora_rank: 16lora_alpha: 32The server supports multiple named adapters for multi-task or multi-policy training. Specify a model_id per request to route to a specific adapter:
{ "batches": [...], "loss_fn": "causallm_loss", "model_id": "policy_v1"}Save a specific adapter:
POST /api/v1/save_weights{"path": "outputs/adapters/policy_v1", "model_id": "policy_v1"}Example Configs
Section titled “Example Configs”See examples/local/dummy/configs/lora/ and examples/server/configs/lora/ for complete LoRA configs across model sizes.
Source
Section titled “Source”| File | Description |
|---|---|
src/xorl/lora/modules/base.py | LoraModule abstract base — defines r, lora_alpha, scaling, and from_module() factory |
src/xorl/lora/modules/linear.py | LoraLinear — LoRA for nn.Linear; A (Kaiming), B (zeros), forward with scaling |
src/xorl/lora/mapping.py | LORA_MAPPING registry and get_lora_class_for_module() lookup |
src/xorl/models/layers/moe/lora.py | MoEExpertsLoRA — LoRA for fused expert tensors [E, I, H] |