Supported Models
xorl currently supports the following model architectures.
Architectures
Section titled “Architectures”| Architecture | HuggingFace class | Example models | Notes |
|---|---|---|---|
| Qwen3 (dense) | Qwen3ForCausalLM | Qwen/Qwen3-8B, Qwen/Qwen3-32B | Standard transformer with GQA, SwiGLU, RoPE. |
| Qwen3-MoE | Qwen3MoeForCausalLM | Qwen/Qwen3-30B-A3B, Qwen/Qwen3-235B-A22B | Mixture-of-Experts with top-k routing. Weight conversion is automatic — see below. |
| Qwen3.5 (dense) | Qwen3_5ForCausalLM | Qwen/Qwen3.5-7B | Qwen3.5 dense with hybrid full/linear attention layers. |
| Qwen3.5-MoE | Qwen3_5MoeForCausalLM | Qwen/Qwen3.5-35B-A3B, Qwen/Qwen3.5-397B-A17B | Qwen3.5 MoE with hybrid attention and grouped expert routing. |
Model selection is config-based: xorl reads model_type from config.json inside model_path and instantiates the appropriate class automatically.
Checkpoint format
Section titled “Checkpoint format”xorl expects checkpoints in HuggingFace format:
config.json— model architecture config*.safetensors— weight shards (single file or multi-shard)tokenizer.json/tokenizer_config.json— tokenizer files
Specify the checkpoint with model_path (local path or HF Hub ID). Use config_path and tokenizer_path separately if your config/tokenizer lives in a different location than the weights.
Key config fields for model loading
Section titled “Key config fields for model loading”| Field | Description |
|---|---|
model_path | Local path or HF Hub ID for weights. |
config_path | Path to config.json. Defaults to model_path. |
tokenizer_path | Path to tokenizer files. Defaults to config_path. |
attn_implementation | Attention backend: flash_attention_3, flash_attention_4, native, sdpa, eager. |
moe_implementation | MoE kernel: null (auto), triton, native, quack, eager. |
Tested configurations
Section titled “Tested configurations”The following model + training-mode combinations have pre-built example configs under examples/server/configs/:
| Model | Full weights | LoRA |
|---|---|---|
| Qwen3-8B | qwen3_8b_full.yaml | qwen3_8b_lora.yaml |
| Qwen3-30B-A3B (MoE) | qwen3_30b_a3b_full.yaml | — |
| Qwen3-Coder-30B-A3B (MoE) | qwen3_coder_30b_a3b_full.yaml | qwen3_coder_30b_a3b_lora.yaml |
| Qwen3-235B-A22B (MoE) | qwen3_235b_a22b_8node_ep64.yaml | — |
| Qwen3.5-35B-A3B (MoE) | qwen3_5_35b_a3b_full.yaml | qwen3_5_35b_a3b_lora.yaml |
| Qwen3.5-397B-A17B (MoE) | qwen3_5_397b_a17b_full.yaml | qwen3_5_397b_a17b_lora.yaml |
MoE models: automatic weight conversion
Section titled “MoE models: automatic weight conversion”MoE checkpoints from HuggingFace store experts as a ModuleList (one module per expert). xorl uses fused grouped-kernel (GKN) tensors for efficient expert dispatch. This conversion happens automatically during model loading — no separate preprocessing step is needed. Simply point model_path at the standard HuggingFace checkpoint and xorl will fuse the expert weights on the fly.
See the MoE section for details on MoE-specific config options including expert_parallel_size, ep_dispatch, and moe_implementation.