xorl

High-performance distributed training for LLMs — RL, SFT, MoE, and beyond.

xorl is a distributed training framework for large language models built for flexibility — composable parallelism, LoRA and QLoRA fine-tuning, MoE, and both local and server training modes for online RL loops.

Explore the docs

Installation Requirements, install with uv or pip, multi-node setup.

Quickstart Launch your first training run in minutes.

Local Training torchrun-based training with FSDP2, TP, PP, and EP.

Server Training REST API-driven training for online RL and live inference.

Parallelism Overview of all supported parallelism strategies.

Config Reference Full reference for all local and server config fields.

Tests How to run and write tests.

Development Guide Workflow, PRs, code style, and authorship.

Features

Composable parallelism

Data (FSDP2), Tensor, Pipeline, Expert, Ulysses sequence, and Ring Attention — all composable across any combination of dimensions.

Local and server training

Run directly with torchrun for offline training, or use the REST API server for online RL loops with live inference engines.

LoRA and QLoRA

Full LoRA support with QLoRA quantization in NVFP4, Block-FP8, and NF4 formats. Adaptive quantization noise and error correction built in.

Mixture of Experts

Fused expert kernels via Triton and Quack, Expert Parallelism with AllToAll and DeepEP (NVLink-optimized), routing cache and replay.

Weight sync

NCCL broadcast from training ranks to SGLang inference endpoints after each step, enabling tight online RL integration.

Muon optimizer

Newton-Schulz orthogonalized gradient descent for 2D+ weight matrices, with Nesterov momentum and configurable LR scaling.