Skip to content
Certified AI Practitioner

Module 6 · The Model Lifecycle: Train, Evaluate, Deploy, Monitor

Measuring Performance: Metrics That Matter

70 min

Learning objectives

  • Explain why accuracy alone can be dangerously misleading
  • Define precision, recall, and F1 and read a confusion matrix intuitively
  • Choose appropriate metrics based on the costs of false positives versus false negatives

The trap of accuracy

Accuracy — the share of predictions that are correct — feels like the obvious score. But it can hide total failure when one outcome is rare. The metric you choose shapes which mistakes you tolerate, so picking it is a real decision, not a formality.

Example — Why 99% accuracy can be terrible

Imagine a disease that affects 1 in 100 people. A 'model' that simply predicts 'healthy' for everyone is 99% accurate — it's right on the 99 healthy people and wrong only on the 1 sick person. Yet it catches zero actual cases. It is 99% accurate and 100% useless for its purpose.

Watch out

Whenever classes are imbalanced — fraud, rare disease, defects, spam — accuracy alone is misleading. Always ask how the model does specifically on the rare, important class.

Reading a confusion matrix

A confusion matrix sorts every prediction into one of four buckets by comparing what the model said against what was actually true. Once you can read it, precision and recall become intuitive.

Actually PositiveActually Negative
Model says PositiveTrue Positive (correct catch)False Positive (false alarm)
Model says NegativeFalse Negative (missed it)True Negative (correct pass)

PrecisionOf everything the model flagged as positive, how much was truly positive: TP / (TP + FP). It answers 'when the model raises an alarm, how often is it right?'

RecallOf everything that was truly positive, how much the model caught: TP / (TP + FN). It answers 'of all the real cases, how many did we find?'

Analogy

Think of a fishing net. Precision is how much of your catch is the fish you wanted (versus seaweed and boots). Recall is how many of the fish in the lake you actually caught. A bigger net catches more fish (recall up) but hauls in more junk (precision down) — there's a trade-off.

Comparing the metrics

MetricWhat it measuresUse when
AccuracyOverall fraction correctClasses are roughly balanced and all errors cost the same
PrecisionFew false alarmsFalse positives are costly (e.g., flagging good transactions as fraud)
RecallFew missesFalse negatives are costly (e.g., missing a cancer or a fraud case)
F1 ScoreBalance of precision and recallYou care about both false alarms and misses and want one number

F1 ScoreThe harmonic mean of precision and recall. It stays low unless both are reasonably high, so it punishes a model that wins on one by sacrificing the other.

Choose metrics by asking which mistake is worse. Missing a tumor (false negative) is far worse than a false alarm, so screening favors recall. Blocking a legitimate customer's card (false positive) is costly, so some fraud checks favor precision.

There is no universally 'best' metric. The right choice falls out of the problem: what does each kind of error cost the people affected? A good practitioner can name that trade-off out loud before training begins.

Knowledge check

Quick practice — not part of your exam score.

A disease affects 1% of patients. A model predicts 'no disease' for everyone and reports 99% accuracy. What is the problem?

For an early cancer screening tool, missing a real case is far worse than a false alarm. Which metric should the team prioritize?

What does the F1 score capture that looking at accuracy alone does not?

Sign in to track your progress and mark lessons complete.

Sign in