Gen AI · Course · 01 of 4
Data Management
Data is the fuel. How you manage it determines how fast you go.
Learning objectives
- Load, stream, and cache datasets using the Hugging Face
datasetslibrary - Convert between CSV, JSON, Parquet, and Arrow formats and explain their tradeoffs
- Create reproducible train/validation/test splits with fixed random seeds
- Manage large model and dataset files using
.gitignore, Git LFS, or DVC
The problem
Every AI project starts with data. You need to find datasets, download them, convert between formats, split them for training and evaluation, and version them so experiments are reproducible. Doing this manually every time is slow and error-prone. You need a repeatable workflow.
Pre-lesson check
0/2 answered// question
Why is it important to have separate train, validation, and test splits in machine learning?
// question
What is the Hugging Face Hub primarily used for in AI/ML workflows?
The concept
The Hugging Face datasets library is the standard way to load data for AI work. It handles downloading, caching, format conversion, and streaming out of the box.
›The data pipeline
Build it
Step 1 — Install the datasets library
pip install datasets huggingface_hubStep 2 — Load a dataset
from datasets import load_dataset
dataset = load_dataset("imdb")
print(dataset)
print(dataset["train"][0])This downloads the IMDB movie review dataset. After the first download, it loads from cache at ~/.cache/huggingface/datasets/.
Step 3 — Stream large datasets
Some datasets are too large to fit on disk. Streaming loads them row by row without downloading the full thing.
dataset = load_dataset(
"wikimedia/wikipedia", "20220301.en", split="train", streaming=True
)
for i, example in enumerate(dataset):
print(example["title"])
if i >= 4:
breakStreaming gives you an IterableDataset. You process rows as they arrive. Memory usage stays constant regardless of dataset size.
Step 4 — Dataset formats
The datasets library uses Apache Arrow under the hood. You can convert to other formats depending on what your pipeline needs.
dataset = load_dataset("imdb", split="train")
dataset.to_csv("imdb_train.csv")
dataset.to_json("imdb_train.json")
dataset.to_parquet("imdb_train.parquet")| Format | Size | Read speed | Best for |
|---|---|---|---|
| CSV | Large | Slow | Human readability, spreadsheets |
| JSON | Large | Slow | APIs, nested data |
| Parquet | Small | Fast | Analytics, columnar queries |
| Arrow | Small | Fastest | In-memory processing (what datasets uses internally) |
Step 5 — Data splits
Every ML project needs three splits:
- Train — the model learns from this (typically 80%)
- Validation — you check progress during training (typically 10%)
- Test — final evaluation after training is done (typically 10%)
Some datasets come pre-split. When they don't, split them yourself:
dataset = load_dataset("imdb", split="train")
split = dataset.train_test_split(test_size=0.2, seed=42)
train_val = split["train"].train_test_split(test_size=0.125, seed=42)
train_ds = train_val["train"]
val_ds = train_val["test"]
test_ds = split["test"]
print(f"Train: {len(train_ds)}, Val: {len(val_ds)}, Test: {len(test_ds)}")Step 6 — Download and cache models
Models are large files. The huggingface_hub library handles downloading and caching.
from huggingface_hub import hf_hub_download, snapshot_download
model_path = hf_hub_download(
repo_id="sentence-transformers/all-MiniLM-L6-v2",
filename="config.json",
)
print(f"Cached at: {model_path}")
model_dir = snapshot_download("sentence-transformers/all-MiniLM-L6-v2")
print(f"Full model at: {model_dir}")Models cache to ~/.cache/huggingface/hub/. Once downloaded, they load instantly on subsequent runs.
Step 7 — Handle large files
Model weights and large datasets should not go into git. Three options:
Option A — .gitignore (simplest)
*.bin
*.safetensors
*.pt
*.onnx
data/*.parquet
data/*.csv
models/Option B — Git LFS (track large files in git)
git lfs install
git lfs track "*.bin"
git lfs track "*.safetensors"
git add .gitattributesGit LFS stores pointers in your repo and the actual files on a separate server. GitHub gives you 1 GB free.
Option C — DVC (data version control)
pip install dvc
dvc init
dvc add data/training_set.parquet
git add data/training_set.parquet.dvc data/.gitignore
git commit -m "Track training data with DVC"DVC creates small .dvc files that point to your data. The data itself lives in S3, GCS, or another remote storage backend.
| Approach | Complexity | Best for |
|---|---|---|
.gitignore | Low | Personal projects, downloaded data you can re-fetch |
| Git LFS | Medium | Teams sharing model weights via git |
| DVC | High | Reproducible experiments, large datasets, teams |
For this course, .gitignore is enough. Use DVC when you need to reproduce exact experiments across machines.
Step 8 — Storage patterns
Local storage works for datasets under ~10 GB. The HF cache handles this automatically. Cloud storage is for anything larger or shared across machines:
import os
local_path = os.path.expanduser("~/.cache/huggingface/datasets/")
# s3_path = "s3://my-bucket/datasets/"
# gcs_path = "gs://my-bucket/datasets/"DVC integrates with S3 and GCS directly:
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc pushFor this course, local storage is sufficient. Cloud storage becomes relevant when you fine-tune on remote GPU instances.
Datasets used in this course
| Dataset | Lessons | Size | What it teaches |
|---|---|---|---|
| IMDB | Tokenization, classification | 84 MB | Text classification basics |
| WikiText | Language modeling | 181 MB | Next-token prediction |
| SQuAD | QA systems | 35 MB | Question answering, spans |
| Common Crawl (subset) | Embeddings | Varies | Large-scale text processing |
| MNIST | Vision basics | 21 MB | Image classification fundamentals |
| COCO (subset) | Multimodal | Varies | Image-text pairs |
You do not need to download all of these now. Each lesson specifies what it needs.
Use it
Run the utility script to verify everything works:
python foundations/data-management/data_utils.pyThis downloads a small dataset, converts it, splits it, and prints a summary. View it without leaving the page: .
Ship it
This lesson produces:
- — reusable data loading and caching utility
- — the full walkthrough as a runnable notebook
- — a prompt for finding the right dataset for a task
Exercises
- Load the
gluedataset with themrpcconfig and inspect the first 5 examples - Stream the
c4dataset and count how many examples you can process in 10 seconds - Convert a dataset to Parquet and compare the file size to CSV
- Create a 70/15/15 train/val/test split with a fixed seed and verify the sizes
Post-lesson quiz
0/3 answered// question
What advantage does the Parquet format have over CSV for storing ML datasets?
// question
What does 'streaming=True' do when loading a dataset with the datasets library?
// question
When should you use DVC instead of just .gitignore for large files?
Key terms
| Term | What people say | What it actually means |
|---|---|---|
| Dataset split | "Training data" | A named subset (train/val/test) used at different stages of the ML lifecycle |
| Streaming | "Load it lazily" | Processing data row by row from a remote source without downloading the full dataset |
| Parquet | "Compressed CSV" | A columnar file format optimized for analytical queries and storage efficiency |
| Arrow | "Fast dataframe" | An in-memory columnar format used internally by datasets for zero-copy reads |
| Git LFS | "Git for big files" | An extension that stores large files outside the git repo while keeping pointers in version control |
| DVC | "Git for data" | A version control system for datasets and models that integrates with cloud storage |
| Cache | "Already downloaded" | A local copy of previously fetched data, stored at ~/.cache/huggingface/ by default |