Composable parallelism
Data (FSDP2), Tensor, Pipeline, Expert, Ulysses sequence, and Ring Attention — all composable across any combination of dimensions.
xorl is a distributed training framework for large language models built for flexibility — composable parallelism, LoRA and QLoRA fine-tuning, MoE, and both local and server training modes for online RL loops.
Composable parallelism
Data (FSDP2), Tensor, Pipeline, Expert, Ulysses sequence, and Ring Attention — all composable across any combination of dimensions.
Local and server training
Run directly with torchrun for offline training, or use the REST API server for online RL loops with live inference engines.
LoRA and QLoRA
Full LoRA support with QLoRA quantization in NVFP4, Block-FP8, and NF4 formats. Adaptive quantization noise and error correction built in.
Mixture of Experts
Fused expert kernels via Triton and Quack, Expert Parallelism with AllToAll and DeepEP (NVLink-optimized), routing cache and replay.
Weight sync
NCCL broadcast from training ranks to SGLang inference endpoints after each step, enabling tight online RL integration.
Muon optimizer
Newton-Schulz orthogonalized gradient descent for 2D+ weight matrices, with Nesterov momentum and configurable LR scaling.