Dataset Loading
xorl uses HuggingFace datasets under the hood. Config fields in the datasets: list map directly to load_dataset() arguments.
Config Structure
Section titled “Config Structure”data: datasets: - path: <source> # local path, HF Hub ID, or cloud URI type: tokenized # only 'tokenized' is currently supported max_seq_len: 4096 # truncate individual samples to this length split: train # ... other HF load_dataset kwargs select_columns: [input_ids, labels] sample_packing_method: sequential # sequential | multipack sample_packing_sequence_len: 8192 # packed bin target lengthData Types
Section titled “Data Types”type | Input format | Notes |
|---|---|---|
tokenized | {input_ids: [...], labels: [...]} | Pre-tokenized; the only currently supported type |
Expected format for tokenized datasets
Section titled “Expected format for tokenized datasets”Each row must contain input_ids (token IDs) and labels (target token IDs, with -100 for positions to ignore):
{"input_ids": [151644, 872, 198, 3838, 374, ...], "labels": [-100, -100, -100, -100, 374, ...]}{"input_ids": [151644, 872, 198, 2170, 1537, ...], "labels": [-100, -100, -100, 2170, 1537, ...]}-100 positions are ignored in the loss (standard PyTorch ignore_index). Typically, prompt tokens are masked (-100) and completion tokens have their actual IDs as labels.
Source Types
Section titled “Source Types”Local File
Section titled “Local File” - path: /data/train.jsonl type: tokenizedSupported formats: .jsonl, .json, .parquet, .csv, .arrow. Format auto-detected from extension.
Local Directory
Section titled “Local Directory” - path: /data/train_dir/ type: tokenizedxorl tries load_from_disk first (HF DatasetDict format), then file discovery.
To load specific files from a directory:
- path: /data/train_dir/ data_files: [shard_00.parquet, shard_01.parquet]HuggingFace Hub
Section titled “HuggingFace Hub” - path: HuggingFaceH4/ultrachat_200k split: train_sft type: tokenizedWith a specific revision:
- path: org/dataset revision: abc123 trust_remote_code: trueAmazon S3
Section titled “Amazon S3” - path: s3://my-bucket/data/train/ type: tokenizedRequires environment variables:
export AWS_ACCESS_KEY_ID=...export AWS_SECRET_ACCESS_KEY=...export AWS_SESSION_TOKEN=... # if using temporary credentialsGoogle Cloud Storage
Section titled “Google Cloud Storage” - path: gs://my-bucket/data/train/Requires GCS credentials configured in the environment.
Azure Data Lake Gen2
Section titled “Azure Data Lake Gen2” - path: abfs://my-container/data/train/Requires:
export AZURE_STORAGE_ACCOUNT_NAME=...export AZURE_STORAGE_ACCOUNT_KEY=...HTTPS (public)
Section titled “HTTPS (public)” - path: https://example.com/data/train.jsonlMultiple Datasets
Section titled “Multiple Datasets”Multiple datasets are concatenated and shuffled:
data: datasets: - path: /data/sft_data.jsonl type: tokenized max_seq_len: 4096 - path: org/public_dataset split: train type: tokenized max_seq_len: 2048Sample Packing
Section titled “Sample Packing”xorl packs multiple short samples into a single training bin of length sample_packing_sequence_len, maximizing GPU utilization.
data: sample_packing_method: sequential # preserve dataset order sample_packing_sequence_len: 8192sequential: Pack samples in dataset order. Fast, deterministic. Some bins may be partially filled.multipack: Optimal bin packing (solves bin-packing). Maximizes fill but slower to compute bins.
Packing bins are cached after first computation. Cached under {dataset_prepared_path}/{dataset_hash}/ (default dataset_prepared_path is last_prepared_dataset).
Ring Attention Constraint
Section titled “Ring Attention Constraint”When using ringattn_parallel_size > 1, each document in a packed bin must have length divisible by 2 × ringattn_parallel_size × ulysses_parallel_size. xorl enforces this automatically during data preparation when ringattn_parallel_size is set.
Column Selection
Section titled “Column Selection”Specify which columns to keep after loading:
data: select_columns: [input_ids, labels]For RL training with importance sampling, include additional fields:
select_columns: [input_ids, labels, advantages, logprobs]Preprocessing
Section titled “Preprocessing”Run preprocessing as a separate step before training (useful for large datasets):
python -m xorl.cli.preprocess config.yamlPreprocessing tokenizes and packs the dataset, saving the result to disk. Training then loads from the preprocessed cache, skipping repeated computation.