Server Architecture
The xorl training server is built as a layered system of independent processes that communicate over ZMQ. This page covers how those layers fit together, what each component does, and the full API surface.
System Overview
Section titled “System Overview”The server is composed of three processes launched by a single Launcher:
Component Deep Dive
Section titled “Component Deep Dive”Launcher
Section titled “Launcher”The Launcher is the single entrypoint. It orchestrates startup and shutdown of all sub-processes.
Two modes:
| Mode | What it does |
|---|---|
auto | Spawns GPU workers via torchrun, then starts Orchestrator and API Server as subprocesses |
connect | Attaches to already-running workers (multi-node: head node connects to workers started manually) |
Startup sequence:
1. _launch_workers_with_torchrun() → torchrun subprocess (all GPU ranks)2. _get_rank0_worker_address() → discover rank 0 ZMQ address (file-based for multi-node)3. multiprocessing.Process(run_orchestrator) → Orchestrator4. _save_initial_checkpoint() → checkpoint "000000" before any training5. multiprocessing.Process(run_api_server) → API Server6. wait() → poll processes until one dies → stop()Key config:
Launcher( mode="auto", config_path="server_config.yaml", api_host="0.0.0.0", api_port=5555, max_running_requests=2, # concurrent in-flight requests max_pending_requests=100, # request queue depth operation_timeout=1800.0, # seconds per operation)API Server
Section titled “API Server”src/xorl/server/api_server/server.py
APIServer is a FastAPI application composed via mixins:
APIServer ├── TrainingOpsMixin forward_backward, forward, optim_step ├── WeightsMixin save/load checkpoints, list, delete, weights_info ├── InferenceEndpointsMixin add/remove/list inference endpoints, sync weights └── HealthMixin /health, /healthz, sleep, wakeTwo-phase async pattern:
Every training operation uses a non-blocking two-phase protocol to avoid HTTP timeout issues on long operations:
Phase 1: POST /api/v1/forward_backward → APIServer sends OrchestratorRequest via ZMQ ROUTER → Returns UntypedAPIFuture { request_id: "uuid-..." } immediately
Phase 2: POST /api/v1/retrieve_future { request_id: "uuid-..." } → Returns TryAgainResponse if still running → Returns result dict when complete → Returns error if failedThe xorl_client SDK handles polling automatically — callers just call .result() on the returned future object.
ZMQ topology:
API Server Orchestrator ROUTER bind :6000 ──────► DEALER connect :6000 (API → Engine) PULL bind :6001 ◄────── PUSH connect :6001 (Engine → API)Messages are serialized with msgpack for efficiency.
Orchestrator
Section titled “Orchestrator”src/xorl/server/orchestrator/orchestrator.py
The Orchestrator runs in its own process as an event loop. It is the bridge between the HTTP API and the GPU workers.
Internal structure:
Key classes:
| Class | File | Role |
|---|---|---|
Orchestrator | orchestrator.py | Main event loop; coordinates all sub-components |
Scheduler | scheduler.py | FIFO queue with per-request state (pending → processing → completed/failed) |
RequestProcessor | request_processor.py | Packs incoming datums into micro-batches; dispatches to backend |
RemoteBackend | backend/remote.py | ZMQ PAIR socket to rank 0 worker; sends commands, receives outputs |
RunnerDispatcher + ModelRunner
Section titled “RunnerDispatcher + ModelRunner”src/xorl/server/runner/runner_dispatcher.py
The dispatcher runs on rank 0 inside the torchrun process group. It acts as the ZMQ boundary between the Orchestrator and the actual GPU computation.
Protocol:
Orchestrator (RemoteBackend) → ZMQ PAIR ──► RunnerDispatcher (rank 0) │ NCCL broadcast → ranks 1..N (command type) │ NCCL scatter → ranks 1..N (batch data) │ ▼ All ranks call ModelRunner.forward_backward() │ │ NCCL gather ← ranks 1..N (outputs) │ ← ZMQ PAIR ◄── RunnerDispatcher (rank 0) (result)Supported commands:
| Command | Action |
|---|---|
FORWARD_BACKWARD | Forward + backward pass, accumulate gradients |
OPTIM_STEP | Apply gradients, step optimizer, advance LR scheduler |
SAVE_STATE | Save DCP checkpoint |
LOAD_STATE | Load DCP checkpoint |
HEALTH_CHECK | Verify all ranks are alive |
SHUTDOWN | Graceful exit |
src/xorl/server/runner/model_runner.py — ModelRunner handles the actual model forward/backward/optimizer step on each rank. It receives the already-distributed micro-batches and calls into the model using the same FSDP2/EP/PP stack as local training.
Protocol Layer
Section titled “Protocol Layer”Messages between the API Server and Orchestrator use typed dataclasses serialized with msgpack:
@dataclassclass OrchestratorRequest: request_id: str # UUID for tracking request_type: RequestType # ADD | ABORT | UTILITY operation: str # "forward_backward" | "optim_step" | ... payload: OperationPayload # ModelPassData | OptimStepData | ... seq_id: Optional[int] # Ordering within a session timestamp: Optional[float]
@dataclassclass OrchestratorOutputs: request_id: str output_type: OutputType # forward | forward_backward | optim_step | error outputs: List[Dict[str, Any]]Messages between the Orchestrator and workers use a separate protocol defined in protocol/orchestrator_runner.py.
Source
Section titled “Source”| File | Description |
|---|---|
src/xorl/server/launcher.py | Launcher — process orchestration, startup/shutdown lifecycle |
src/xorl/server/api_server/server.py | APIServer — FastAPI app, OrchestratorClient, FutureStore |
src/xorl/server/api_server/endpoints.py | All FastAPI endpoint handlers |
src/xorl/server/orchestrator/orchestrator.py | Orchestrator — event loop, request queue management |
src/xorl/server/orchestrator/scheduler.py | Scheduler — FIFO request ordering |
src/xorl/server/orchestrator/request_processor.py | RequestProcessor — sample packing, backend dispatch |
src/xorl/server/runner/runner_dispatcher.py | RunnerDispatcher — rank 0 ZMQ↔NCCL bridge |
src/xorl/server/runner/model_runner.py | ModelRunner — actual forward/backward/optim on GPU ranks |
src/xorl/server/protocol/ | Typed msgpack protocol messages between all layers |