Module 5 · Data — The Fuel of AI

Labeling, Dataset Splitting & the Danger of Data Leakage

70 min

Learning objectives

Explain what labels are and why labeling quality matters for supervised learning
Describe the purpose of train, validation, and test splits and how they differ
Define data leakage, recognize common causes, and explain why it ruins models

Labels: the answer key

In supervised learning — the most common kind — each training example carries a label, the correct answer the model should learn to produce. An email labeled 'spam,' a photo labeled 'cat,' a loan labeled 'defaulted.' The model studies thousands of input-label pairs and learns the mapping from one to the other.

Label — The correct target answer attached to a training example; what a supervised model is trained to predict.

If labels are noisy or inconsistent, the model learns the wrong answer key. Labeling quality is a first-class data-quality concern, not an afterthought.

Example — Inconsistent labeling

Two annotators label support tickets. One marks billing questions as 'finance,' the other as 'account.' The model now receives contradictory signals for nearly identical tickets and struggles to learn a clean boundary. A clear labeling guideline would have prevented it.

Splitting the data: train, validation, test

To know whether a model truly learned — versus merely memorized — we split the data into three disjoint parts. The training set fits the model. The validation set is used to tune choices like settings and to compare candidate models. The test set is locked away and used once, at the end, to estimate how the model will perform on genuinely unseen data.

Split	Purpose	Used how often
Training	Fit the model's parameters	Many times, every training run
Validation	Tune settings, compare models, detect overfitting	Repeatedly during development
Test	Final, honest estimate of real-world performance	Once, at the very end

Analogy

It's like studying for an exam. The training set is your textbook, the validation set is the practice quiz you use to adjust how you study, and the test set is the real exam. If you peek at the real exam questions while studying, your score is meaningless — that's the heart of data leakage.

Watch out

Never tune your model against the test set. Each time you make a decision based on test results, the test set quietly turns into another validation set and stops giving an honest estimate.

Data leakage: the silent model-killer

Data leakage happens when information that would not be available at prediction time leaks into training. The model 'cheats' by using that information, scores beautifully in testing, then collapses in production where the leaked signal is gone. Leakage is dangerous precisely because the metrics look great — the failure is invisible until deployment.

Data leakage — When information unavailable at real prediction time enters training, inflating test scores and causing failure in production.

Example — Classic leakage

A model predicts which patients will be hospitalized. A feature called 'discharge_date' is included — but a discharge date only exists after hospitalization. The model achieves near-perfect accuracy in testing, then fails in the real world because that field is empty at the moment a prediction is actually needed.

Target leakage — a feature is a stand-in for the answer or is only known after the outcome occurs.
Train/test contamination — the same record, or duplicates, appear in both training and test sets.
Preprocessing leakage — scaling or filling missing values using statistics computed over the whole dataset before splitting, so test information bleeds into training.
Temporal leakage — using future information to predict the past in time-series data.

Defense against leakage: split first, then preprocess using only the training set's statistics, and ask of every feature, 'Would this value really be available at the moment of prediction?'

Knowledge check

Quick practice — not part of your exam score.

Which split should be used only once, at the very end, to estimate real-world performance?

A churn model includes the feature 'cancellation_processed_date,' which only gets filled after a customer has already churned. Test accuracy is near-perfect but production performance is terrible. What happened?

Why is it best practice to split data into train/validation/test BEFORE computing preprocessing statistics like averages for filling missing values?

← Data Types, Sources & Quality Privacy & Governance: Consent, DPDP & GDPR →