Datasets

Name: Datasets
Availability: InStock

Hugging Face library for large shared datasets: memory mapping, streaming, Arrow-backed columns, and Hub integration.

Why it is included

Foundational OSS for reproducible NLP/LLM training data loading at scale.

Anyone fine-tuning or evaluating on multi-terabyte corpora without custom loaders.

WebDataset · Petastorm · tf.data