Module 3 · Generative AI & LLMs — Foundations

Beyond Text: Images, Audio & Multimodal Models

55 min

Learning objectives

Describe how generative models extend beyond text to images and audio
Explain what 'multimodal' means and give practical examples
Recognize that core limits like hallucination carry over to other modalities

The same idea, other media

The generative principle isn't limited to text. The same broad approach — learn the patterns in a huge body of examples, then produce new examples — applies to images, audio, video, and code. The data type changes; the core idea of modeling patterns and generating plausible new content does not.

Image generation: produce a picture from a text description (“a watercolor of a lighthouse at dusk”).
Speech: convert text to natural-sounding voice (text-to-speech) or speech to text (transcription).
Music and sound: generate melodies, sound effects, or full tracks from a description.
Video: generate or edit short clips from text or image prompts.

Many image generators use a technique called diffusion: they start from random noise and repeatedly refine it toward an image that matches the prompt. You don't need the math — the intuition is 'sculpt a clear picture out of static, guided by the words.'

Analogy

Diffusion is like a sculptor revealing a statue from a rough block: each pass removes noise and brings the intended shape into focus, guided by your description.

What “multimodal” means

A multimodal model handles more than one type of data at once. You can show it a photo and ask a question about it, hand it a chart and ask for the trend, or speak to it and get a spoken reply. Many flagship assistants in 2026 are multimodal: text, images, and audio flow through one model rather than separate tools bolted together.

Multimodal model — A model that accepts and/or produces more than one type of data — such as text plus images or audio — within a single system.

Example — Multimodal in action

Photograph a fridge's contents and ask 'what can I cook tonight?' — the model reads the image and replies in text. Or read a passage aloud and ask for a spoken summary. The same model spans the input and output types.

Modality	Example input	Example output
Text → Image	“A red barn at sunset”	A generated picture
Image → Text	A photo of a receipt	Extracted line items
Text → Audio	A paragraph of script	Spoken narration
Audio → Text	A recorded meeting	A written transcript

Multimodal doesn't mean 'smarter' — it means 'works across data types.' One model can now see, read, listen, and speak, which collapses workflows that once needed several separate systems.

Watch out

The limits from Lesson 3 travel with these modalities. Image generators can produce convincing but wrong details (extra fingers, fake text, invented logos), and any model can be used to create deceptive deepfakes. 'Looks real' is never proof that it is real or accurate.

Knowledge check

Quick practice — not part of your exam score.

What does it mean for a model to be 'multimodal'?

Which statement about generative limits across modalities is most accurate?

← Capabilities & Hard Limits: Hallucination, Context Windows & Knowledge Cutoffs Prompting Fundamentals: Getting What You Asked For →