Module 6 · The Model Lifecycle: Train, Evaluate, Deploy, Monitor

Measuring Performance: Metrics That Matter

70 min

Learning objectives

Explain why accuracy alone can be dangerously misleading
Define precision, recall, and F1 and read a confusion matrix intuitively
Choose appropriate metrics based on the costs of false positives versus false negatives

The trap of accuracy

Accuracy — the share of predictions that are correct — feels like the obvious score. But it can hide total failure when one outcome is rare. The metric you choose shapes which mistakes you tolerate, so picking it is a real decision, not a formality.

Example — Why 99% accuracy can be terrible

Imagine a disease that affects 1 in 100 people. A 'model' that simply predicts 'healthy' for everyone is 99% accurate — it's right on the 99 healthy people and wrong only on the 1 sick person. Yet it catches zero actual cases. It is 99% accurate and 100% useless for its purpose.

Watch out

Whenever classes are imbalanced — fraud, rare disease, defects, spam — accuracy alone is misleading. Always ask how the model does specifically on the rare, important class.

Reading a confusion matrix

A confusion matrix sorts every prediction into one of four buckets by comparing what the model said against what was actually true. Once you can read it, precision and recall become intuitive.

	Actually Positive	Actually Negative
Model says Positive	True Positive (correct catch)	False Positive (false alarm)
Model says Negative	False Negative (missed it)	True Negative (correct pass)

Precision — Of everything the model flagged as positive, how much was truly positive: TP / (TP + FP). It answers 'when the model raises an alarm, how often is it right?'

Recall — Of everything that was truly positive, how much the model caught: TP / (TP + FN). It answers 'of all the real cases, how many did we find?'

Analogy

Think of a fishing net. Precision is how much of your catch is the fish you wanted (versus seaweed and boots). Recall is how many of the fish in the lake you actually caught. A bigger net catches more fish (recall up) but hauls in more junk (precision down) — there's a trade-off.

Comparing the metrics

Metric	What it measures	Use when
Accuracy	Overall fraction correct	Classes are roughly balanced and all errors cost the same
Precision	Few false alarms	False positives are costly (e.g., flagging good transactions as fraud)
Recall	Few misses	False negatives are costly (e.g., missing a cancer or a fraud case)
F1 Score	Balance of precision and recall	You care about both false alarms and misses and want one number

F1 Score — The harmonic mean of precision and recall. It stays low unless both are reasonably high, so it punishes a model that wins on one by sacrificing the other.

Choose metrics by asking which mistake is worse. Missing a tumor (false negative) is far worse than a false alarm, so screening favors recall. Blocking a legitimate customer's card (false positive) is costly, so some fraud checks favor precision.

There is no universally 'best' metric. The right choice falls out of the problem: what does each kind of error cost the people affected? A good practitioner can name that trade-off out loud before training begins.

Knowledge check

Quick practice — not part of your exam score.

A disease affects 1% of patients. A model predicts 'no disease' for everyone and reports 99% accuracy. What is the problem?

For an early cancer screening tool, missing a real case is far worse than a false alarm. Which metric should the team prioritize?

What does the F1 score capture that looking at accuracy alone does not?

← The End-to-End Lifecycle Deployment, Monitoring & Model Drift →