Skip to content

Server Architecture

The xorl training server is built as a layered system of independent processes that communicate over ZMQ. This page covers how those layers fit together, what each component does, and the full API surface.


The server is composed of three processes launched by a single Launcher:

xorl Server: Process ArchitectureLauncherspawns 3 processesAPI ServerFastAPI · HTTP :5555ROUTER :6000 (send)PULL :6001 (receive)Orchestratorrequest queue + schedulerDEALER :6000 (receive)PUSH :6001 (send)Workers (torchrun)rank 0..N · GPUPAIR rank0 :5556NCCL broadcast/scatterZMQZMQAbstraction LayersHTTP APIClient-facingxorl-client SDKAPIServerAuth, routingasync futuresOrchestratorSchedulingpacking, dispatchRunnerDispatcherRank 0 proxybroadcast/scatterModelRunnerfwd/bwd/optimFSDP · EP · PP

src/xorl/server/launcher.py

The Launcher is the single entrypoint. It orchestrates startup and shutdown of all sub-processes.

Two modes:

ModeWhat it does
autoSpawns GPU workers via torchrun, then starts Orchestrator and API Server as subprocesses
connectAttaches to already-running workers (multi-node: head node connects to workers started manually)

Startup sequence:

1. _launch_workers_with_torchrun() → torchrun subprocess (all GPU ranks)
2. _get_rank0_worker_address() → discover rank 0 ZMQ address (file-based for multi-node)
3. multiprocessing.Process(run_orchestrator) → Orchestrator
4. _save_initial_checkpoint() → checkpoint "000000" before any training
5. multiprocessing.Process(run_api_server) → API Server
6. wait() → poll processes until one dies → stop()

Key config:

Launcher(
mode="auto",
config_path="server_config.yaml",
api_host="0.0.0.0",
api_port=5555,
max_running_requests=2, # concurrent in-flight requests
max_pending_requests=100, # request queue depth
operation_timeout=1800.0, # seconds per operation
)

src/xorl/server/api_server/server.py

APIServer is a FastAPI application composed via mixins:

APIServer
├── TrainingOpsMixin forward_backward, forward, optim_step
├── WeightsMixin save/load checkpoints, list, delete, weights_info
├── InferenceEndpointsMixin add/remove/list inference endpoints, sync weights
└── HealthMixin /health, /healthz, sleep, wake

Two-phase async pattern:

Every training operation uses a non-blocking two-phase protocol to avoid HTTP timeout issues on long operations:

Phase 1: POST /api/v1/forward_backward
→ APIServer sends OrchestratorRequest via ZMQ ROUTER
→ Returns UntypedAPIFuture { request_id: "uuid-..." } immediately
Phase 2: POST /api/v1/retrieve_future { request_id: "uuid-..." }
→ Returns TryAgainResponse if still running
→ Returns result dict when complete
→ Returns error if failed

The xorl_client SDK handles polling automatically — callers just call .result() on the returned future object.

ZMQ topology:

API Server Orchestrator
ROUTER bind :6000 ──────► DEALER connect :6000 (API → Engine)
PULL bind :6001 ◄────── PUSH connect :6001 (Engine → API)

Messages are serialized with msgpack for efficiency.


src/xorl/server/orchestrator/orchestrator.py

The Orchestrator runs in its own process as an event loop. It is the bridge between the HTTP API and the GPU workers.

Internal structure:

Orchestrator Internal StructureAPI ServerZMQ ROUTER/PULLinput_threadrecv ZMQ messages→ Scheduler.add()SchedulerFIFO queuepending → processing→ completed/failedRequestProcessorsample packingbatch dispatch→ RemoteBackendRemoteBackendZMQ PAIR → Workersoutput_threadrecv results from worker → ZMQ PUSH → API Serverevent_loop() thread: dequeue from Scheduler → RequestProcessor → await resultProtocol messages: OrchestratorRequest / OrchestratorOutputs (msgpack)

Key classes:

ClassFileRole
Orchestratororchestrator.pyMain event loop; coordinates all sub-components
Schedulerscheduler.pyFIFO queue with per-request state (pending → processing → completed/failed)
RequestProcessorrequest_processor.pyPacks incoming datums into micro-batches; dispatches to backend
RemoteBackendbackend/remote.pyZMQ PAIR socket to rank 0 worker; sends commands, receives outputs

src/xorl/server/runner/runner_dispatcher.py

The dispatcher runs on rank 0 inside the torchrun process group. It acts as the ZMQ boundary between the Orchestrator and the actual GPU computation.

Protocol:

Orchestrator (RemoteBackend)
→ ZMQ PAIR ──► RunnerDispatcher (rank 0)
│ NCCL broadcast → ranks 1..N (command type)
│ NCCL scatter → ranks 1..N (batch data)
▼ All ranks call ModelRunner.forward_backward()
│ NCCL gather ← ranks 1..N (outputs)
← ZMQ PAIR ◄── RunnerDispatcher (rank 0) (result)

Supported commands:

CommandAction
FORWARD_BACKWARDForward + backward pass, accumulate gradients
OPTIM_STEPApply gradients, step optimizer, advance LR scheduler
SAVE_STATESave DCP checkpoint
LOAD_STATELoad DCP checkpoint
HEALTH_CHECKVerify all ranks are alive
SHUTDOWNGraceful exit

src/xorl/server/runner/model_runner.pyModelRunner handles the actual model forward/backward/optimizer step on each rank. It receives the already-distributed micro-batches and calls into the model using the same FSDP2/EP/PP stack as local training.


Messages between the API Server and Orchestrator use typed dataclasses serialized with msgpack:

src/xorl/server/protocol/api_orchestrator.py
@dataclass
class OrchestratorRequest:
request_id: str # UUID for tracking
request_type: RequestType # ADD | ABORT | UTILITY
operation: str # "forward_backward" | "optim_step" | ...
payload: OperationPayload # ModelPassData | OptimStepData | ...
seq_id: Optional[int] # Ordering within a session
timestamp: Optional[float]
@dataclass
class OrchestratorOutputs:
request_id: str
output_type: OutputType # forward | forward_backward | optim_step | error
outputs: List[Dict[str, Any]]

Messages between the Orchestrator and workers use a separate protocol defined in protocol/orchestrator_runner.py.


FileDescription
src/xorl/server/launcher.pyLauncher — process orchestration, startup/shutdown lifecycle
src/xorl/server/api_server/server.pyAPIServer — FastAPI app, OrchestratorClient, FutureStore
src/xorl/server/api_server/endpoints.pyAll FastAPI endpoint handlers
src/xorl/server/orchestrator/orchestrator.pyOrchestrator — event loop, request queue management
src/xorl/server/orchestrator/scheduler.pyScheduler — FIFO request ordering
src/xorl/server/orchestrator/request_processor.pyRequestProcessor — sample packing, backend dispatch
src/xorl/server/runner/runner_dispatcher.pyRunnerDispatcher — rank 0 ZMQ↔NCCL bridge
src/xorl/server/runner/model_runner.pyModelRunner — actual forward/backward/optim on GPU ranks
src/xorl/server/protocol/Typed msgpack protocol messages between all layers