Module 6 · The Model Lifecycle: Train, Evaluate, Deploy, Monitor
Measuring Performance: Metrics That Matter
70 min
Learning objectives
- Explain why accuracy alone can be dangerously misleading
- Define precision, recall, and F1 and read a confusion matrix intuitively
- Choose appropriate metrics based on the costs of false positives versus false negatives
The trap of accuracy
Accuracy — the share of predictions that are correct — feels like the obvious score. But it can hide total failure when one outcome is rare. The metric you choose shapes which mistakes you tolerate, so picking it is a real decision, not a formality.
Example — Why 99% accuracy can be terrible
Imagine a disease that affects 1 in 100 people. A 'model' that simply predicts 'healthy' for everyone is 99% accurate — it's right on the 99 healthy people and wrong only on the 1 sick person. Yet it catches zero actual cases. It is 99% accurate and 100% useless for its purpose.
Watch out
Whenever classes are imbalanced — fraud, rare disease, defects, spam — accuracy alone is misleading. Always ask how the model does specifically on the rare, important class.
Reading a confusion matrix
A confusion matrix sorts every prediction into one of four buckets by comparing what the model said against what was actually true. Once you can read it, precision and recall become intuitive.
| Actually Positive | Actually Negative | |
|---|---|---|
| Model says Positive | True Positive (correct catch) | False Positive (false alarm) |
| Model says Negative | False Negative (missed it) | True Negative (correct pass) |
Precision — Of everything the model flagged as positive, how much was truly positive: TP / (TP + FP). It answers 'when the model raises an alarm, how often is it right?'
Recall — Of everything that was truly positive, how much the model caught: TP / (TP + FN). It answers 'of all the real cases, how many did we find?'
Analogy
Think of a fishing net. Precision is how much of your catch is the fish you wanted (versus seaweed and boots). Recall is how many of the fish in the lake you actually caught. A bigger net catches more fish (recall up) but hauls in more junk (precision down) — there's a trade-off.
Comparing the metrics
| Metric | What it measures | Use when |
|---|---|---|
| Accuracy | Overall fraction correct | Classes are roughly balanced and all errors cost the same |
| Precision | Few false alarms | False positives are costly (e.g., flagging good transactions as fraud) |
| Recall | Few misses | False negatives are costly (e.g., missing a cancer or a fraud case) |
| F1 Score | Balance of precision and recall | You care about both false alarms and misses and want one number |
F1 Score — The harmonic mean of precision and recall. It stays low unless both are reasonably high, so it punishes a model that wins on one by sacrificing the other.
Choose metrics by asking which mistake is worse. Missing a tumor (false negative) is far worse than a false alarm, so screening favors recall. Blocking a legitimate customer's card (false positive) is costly, so some fraud checks favor precision.
There is no universally 'best' metric. The right choice falls out of the problem: what does each kind of error cost the people affected? A good practitioner can name that trade-off out loud before training begins.
Knowledge check
Quick practice — not part of your exam score.
A disease affects 1% of patients. A model predicts 'no disease' for everyone and reports 99% accuracy. What is the problem?
For an early cancer screening tool, missing a real case is far worse than a false alarm. Which metric should the team prioritize?
What does the F1 score capture that looking at accuracy alone does not?
Sign in to track your progress and mark lessons complete.
Sign in