Skip to content

Dataset Loading

xorl uses HuggingFace datasets under the hood. Config fields in the datasets: list map directly to load_dataset() arguments.

data:
datasets:
- path: <source> # local path, HF Hub ID, or cloud URI
type: tokenized # only 'tokenized' is currently supported
max_seq_len: 4096 # truncate individual samples to this length
split: train
# ... other HF load_dataset kwargs
select_columns: [input_ids, labels]
sample_packing_method: sequential # sequential | multipack
sample_packing_sequence_len: 8192 # packed bin target length
typeInput formatNotes
tokenized{input_ids: [...], labels: [...]}Pre-tokenized; the only currently supported type

Each row must contain input_ids (token IDs) and labels (target token IDs, with -100 for positions to ignore):

{"input_ids": [151644, 872, 198, 3838, 374, ...], "labels": [-100, -100, -100, -100, 374, ...]}
{"input_ids": [151644, 872, 198, 2170, 1537, ...], "labels": [-100, -100, -100, 2170, 1537, ...]}

-100 positions are ignored in the loss (standard PyTorch ignore_index). Typically, prompt tokens are masked (-100) and completion tokens have their actual IDs as labels.

- path: /data/train.jsonl
type: tokenized

Supported formats: .jsonl, .json, .parquet, .csv, .arrow. Format auto-detected from extension.

- path: /data/train_dir/
type: tokenized

xorl tries load_from_disk first (HF DatasetDict format), then file discovery.

To load specific files from a directory:

- path: /data/train_dir/
data_files: [shard_00.parquet, shard_01.parquet]
- path: HuggingFaceH4/ultrachat_200k
split: train_sft
type: tokenized

With a specific revision:

- path: org/dataset
revision: abc123
trust_remote_code: true
- path: s3://my-bucket/data/train/
type: tokenized

Requires environment variables:

Terminal window
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=... # if using temporary credentials
- path: gs://my-bucket/data/train/

Requires GCS credentials configured in the environment.

- path: abfs://my-container/data/train/

Requires:

Terminal window
export AZURE_STORAGE_ACCOUNT_NAME=...
export AZURE_STORAGE_ACCOUNT_KEY=...
- path: https://example.com/data/train.jsonl

Multiple datasets are concatenated and shuffled:

data:
datasets:
- path: /data/sft_data.jsonl
type: tokenized
max_seq_len: 4096
- path: org/public_dataset
split: train
type: tokenized
max_seq_len: 2048

xorl packs multiple short samples into a single training bin of length sample_packing_sequence_len, maximizing GPU utilization.

data:
sample_packing_method: sequential # preserve dataset order
sample_packing_sequence_len: 8192
  • sequential: Pack samples in dataset order. Fast, deterministic. Some bins may be partially filled.
  • multipack: Optimal bin packing (solves bin-packing). Maximizes fill but slower to compute bins.

Packing bins are cached after first computation. Cached under {dataset_prepared_path}/{dataset_hash}/ (default dataset_prepared_path is last_prepared_dataset).

When using ringattn_parallel_size > 1, each document in a packed bin must have length divisible by 2 × ringattn_parallel_size × ulysses_parallel_size. xorl enforces this automatically during data preparation when ringattn_parallel_size is set.

Specify which columns to keep after loading:

data:
select_columns: [input_ids, labels]

For RL training with importance sampling, include additional fields:

select_columns: [input_ids, labels, advantages, logprobs]

Run preprocessing as a separate step before training (useful for large datasets):

Terminal window
python -m xorl.cli.preprocess config.yaml

Preprocessing tokenizes and packs the dataset, saving the result to disk. Training then loads from the preprocessed cache, skipping repeated computation.