Module 5 · Data — The Fuel of AI

Data Types, Sources & Quality

65 min

Learning objectives

Distinguish structured from unstructured data and identify common sources of each
Explain the principle of 'garbage in, garbage out' and how data quality bounds model quality
Assess a dataset for representativeness, completeness, and bias before training

Data is the raw material

Every machine-learning model learns its behavior from data. The architecture and compute matter, but the data is what actually teaches the model what 'right' looks like. A practitioner who can read code but cannot judge data quality is missing the more decisive skill.

Analogy

Think of data as the ingredients in a kitchen. A brilliant chef (the algorithm) cannot rescue a dish made from spoiled, mislabeled, or missing ingredients. Great cooking starts with good groceries — great models start with good data.

Structured vs. unstructured

Data comes in two broad shapes. Structured data fits neatly into rows and columns with named fields — think of a sales table with date, amount, and region. Unstructured data has no fixed schema — emails, product photos, call recordings, scanned PDFs. Most of the world's data is unstructured, which is exactly why modern AI, good at images and language, has been so valuable.

Aspect	Structured	Unstructured
Format	Rows and columns, fixed fields	Free text, images, audio, video
Examples	Transactions, sensor logs, CRM records	Emails, photos, call recordings, contracts
Typical store	Relational database / spreadsheet	File store, data lake, object storage
Ease of use	Ready for analysis quickly	Needs processing/feature extraction first

Structured data — Data organized in a predefined row-and-column schema, such as a database table or spreadsheet.

Unstructured data — Data with no predefined schema — text, images, audio, video — that must be processed before a model can learn from it.

Garbage in, garbage out

The oldest rule in computing applies double to AI: if you train on flawed data, you get a flawed model — confidently. Models do not correct bad inputs; they faithfully learn whatever patterns exist in the data, including the errors, gaps, and biases.

A model's quality is capped by its data's quality. No amount of tuning fixes a dataset that is wrong, incomplete, or unrepresentative.

Example — Representativeness gone wrong

A hospital trains a skin-condition classifier almost entirely on images of light skin. In the clinic it misses conditions on darker skin because those cases were rare or absent in training. The algorithm was fine; the data did not represent the real population the model would serve.

Judging data quality

Accuracy — are the values and labels actually correct?
Completeness — how many fields or records are missing, and is the missingness random or systematic?
Consistency — are units, formats, and definitions the same across sources?
Representativeness — does the data reflect the population and conditions the model will face in production?
Timeliness — is the data current enough that the patterns still hold?

Watch out

Beware sampling bias: data collected from a convenient or narrow source (one region, one time window, one channel) can look large and clean yet still misrepresent reality. Size is not the same as representativeness.

Knowledge check

Quick practice — not part of your exam score.

Which of the following is the best example of unstructured data?

A team builds a high-accuracy hiring model but later finds it performs poorly on applicants from regions barely present in the training data. This is primarily a failure of:

What does the principle 'garbage in, garbage out' mean for machine learning?

← Evaluating & Guarding Generative Systems Labeling, Dataset Splitting & the Danger of Data Leakage →