Module 5 · Data — The Fuel of AI
Data Types, Sources & Quality
65 min
Learning objectives
- Distinguish structured from unstructured data and identify common sources of each
- Explain the principle of 'garbage in, garbage out' and how data quality bounds model quality
- Assess a dataset for representativeness, completeness, and bias before training
Data is the raw material
Every machine-learning model learns its behavior from data. The architecture and compute matter, but the data is what actually teaches the model what 'right' looks like. A practitioner who can read code but cannot judge data quality is missing the more decisive skill.
Analogy
Think of data as the ingredients in a kitchen. A brilliant chef (the algorithm) cannot rescue a dish made from spoiled, mislabeled, or missing ingredients. Great cooking starts with good groceries — great models start with good data.
Structured vs. unstructured
Data comes in two broad shapes. Structured data fits neatly into rows and columns with named fields — think of a sales table with date, amount, and region. Unstructured data has no fixed schema — emails, product photos, call recordings, scanned PDFs. Most of the world's data is unstructured, which is exactly why modern AI, good at images and language, has been so valuable.
| Aspect | Structured | Unstructured |
|---|---|---|
| Format | Rows and columns, fixed fields | Free text, images, audio, video |
| Examples | Transactions, sensor logs, CRM records | Emails, photos, call recordings, contracts |
| Typical store | Relational database / spreadsheet | File store, data lake, object storage |
| Ease of use | Ready for analysis quickly | Needs processing/feature extraction first |
Structured data — Data organized in a predefined row-and-column schema, such as a database table or spreadsheet.
Unstructured data — Data with no predefined schema — text, images, audio, video — that must be processed before a model can learn from it.
Garbage in, garbage out
The oldest rule in computing applies double to AI: if you train on flawed data, you get a flawed model — confidently. Models do not correct bad inputs; they faithfully learn whatever patterns exist in the data, including the errors, gaps, and biases.
A model's quality is capped by its data's quality. No amount of tuning fixes a dataset that is wrong, incomplete, or unrepresentative.
Example — Representativeness gone wrong
A hospital trains a skin-condition classifier almost entirely on images of light skin. In the clinic it misses conditions on darker skin because those cases were rare or absent in training. The algorithm was fine; the data did not represent the real population the model would serve.
Judging data quality
- Accuracy — are the values and labels actually correct?
- Completeness — how many fields or records are missing, and is the missingness random or systematic?
- Consistency — are units, formats, and definitions the same across sources?
- Representativeness — does the data reflect the population and conditions the model will face in production?
- Timeliness — is the data current enough that the patterns still hold?
Watch out
Beware sampling bias: data collected from a convenient or narrow source (one region, one time window, one channel) can look large and clean yet still misrepresent reality. Size is not the same as representativeness.
Knowledge check
Quick practice — not part of your exam score.
Which of the following is the best example of unstructured data?
A team builds a high-accuracy hiring model but later finds it performs poorly on applicants from regions barely present in the training data. This is primarily a failure of:
What does the principle 'garbage in, garbage out' mean for machine learning?
Sign in to track your progress and mark lessons complete.
Sign in