Datasets
Hugging Face library for large shared datasets: memory mapping, streaming, Arrow-backed columns, and Hub integration.
Why it is included
Foundational OSS for reproducible NLP/LLM training data loading at scale.
Best for
Anyone fine-tuning or evaluating on multi-terabyte corpora without custom loaders.
Strengths
- Streaming
- Cache
- Hub interoperability
Limitations
- Very custom data may still need bespoke preprocessing
Good alternatives
WebDataset · Petastorm · tf.data
Related tools
AI & Machine Learning
Hugging Face Transformers
State-of-the-art pretrained models for PyTorch, TensorFlow, and JAX.
AI & Machine Learning
DVC
Data version control for ML: version datasets and models with Git, cloud storage, and reproducible pipelines.
AI & Machine Learning
Axolotl
YAML-configured fine-tuning for LLMs: LoRA, QLoRA, FSDP, and many architectures on top of Hugging Face trainers.
AI & Machine Learning
SmolLM
Hugging Face TB small LM family (135M–1.7B) with Apache-2.0 weights aimed at on-device and edge quality per size.
AI & Machine Learning
OpenAI gpt-oss (Hub)
OpenAI’s open-weight GPT-OSS checkpoints (e.g. 20B, 120B) hosted on Hugging Face for local inference and fine-tuning.
AI & Machine Learning
GPT-2 (Hugging Face)
Historic decoder-only LM family (124M–1.5B) under `openai-community` on the Hub—still a default tutorial and pipeline test target.
