Module 3 · Generative AI & LLMs — Foundations
Beyond Text: Images, Audio & Multimodal Models
55 min
Learning objectives
- Describe how generative models extend beyond text to images and audio
- Explain what 'multimodal' means and give practical examples
- Recognize that core limits like hallucination carry over to other modalities
The same idea, other media
The generative principle isn't limited to text. The same broad approach — learn the patterns in a huge body of examples, then produce new examples — applies to images, audio, video, and code. The data type changes; the core idea of modeling patterns and generating plausible new content does not.
- Image generation: produce a picture from a text description (“a watercolor of a lighthouse at dusk”).
- Speech: convert text to natural-sounding voice (text-to-speech) or speech to text (transcription).
- Music and sound: generate melodies, sound effects, or full tracks from a description.
- Video: generate or edit short clips from text or image prompts.
Many image generators use a technique called diffusion: they start from random noise and repeatedly refine it toward an image that matches the prompt. You don't need the math — the intuition is 'sculpt a clear picture out of static, guided by the words.'
Analogy
Diffusion is like a sculptor revealing a statue from a rough block: each pass removes noise and brings the intended shape into focus, guided by your description.
What “multimodal” means
A multimodal model handles more than one type of data at once. You can show it a photo and ask a question about it, hand it a chart and ask for the trend, or speak to it and get a spoken reply. Many flagship assistants in 2026 are multimodal: text, images, and audio flow through one model rather than separate tools bolted together.
Multimodal model — A model that accepts and/or produces more than one type of data — such as text plus images or audio — within a single system.
Example — Multimodal in action
Photograph a fridge's contents and ask 'what can I cook tonight?' — the model reads the image and replies in text. Or read a passage aloud and ask for a spoken summary. The same model spans the input and output types.
| Modality | Example input | Example output |
|---|---|---|
| Text → Image | “A red barn at sunset” | A generated picture |
| Image → Text | A photo of a receipt | Extracted line items |
| Text → Audio | A paragraph of script | Spoken narration |
| Audio → Text | A recorded meeting | A written transcript |
Multimodal doesn't mean 'smarter' — it means 'works across data types.' One model can now see, read, listen, and speak, which collapses workflows that once needed several separate systems.
Watch out
The limits from Lesson 3 travel with these modalities. Image generators can produce convincing but wrong details (extra fingers, fake text, invented logos), and any model can be used to create deceptive deepfakes. 'Looks real' is never proof that it is real or accurate.
Knowledge check
Quick practice — not part of your exam score.
What does it mean for a model to be 'multimodal'?
Which statement about generative limits across modalities is most accurate?
Sign in to track your progress and mark lessons complete.
Sign in