Module 3 · Generative AI & LLMs — Foundations

Inside a Large Language Model: Tokens, Transformers & Next-Token Prediction

70 min

Learning objectives

Explain what a token is and why models work in tokens rather than words or letters
Describe, intuitively, how a transformer uses attention to relate words
Explain next-token prediction and how it produces coherent text
Connect these mechanics to why LLMs behave the way they do

Step 1: text becomes tokens

An LLM does not read letters or whole words. It reads tokens — chunks of text that are usually a few characters long. A common word like “the” is one token; a longer or rarer word like “unbelievable” may be split into several (“un”, “believ”, “able”). Every piece of input is converted into a sequence of these tokens before the model sees anything.

Token — A chunk of text — often a word fragment of a few characters — that is the basic unit a language model reads and writes.

Analogy

Tokens are like LEGO bricks for text. The model doesn't think in finished sculptures (full sentences) or in the plastic itself (individual letters); it works one brick at a time, snapping the next most-likely brick onto what it has built so far.

Example — Why tokens matter in practice

Because models count tokens, a rough rule of thumb in English is that one token is about four characters, or roughly three-quarters of a word. This is why pricing, speed, and the 'context window' are all measured in tokens — and why asking a model to count the letters in a word can trip it up: a major reason is that it processes whole tokens rather than seeing each individual letter.

Step 2: the transformer pays attention

Once text is tokens, a transformer processes them. Its key trick is attention: for each token, the model weighs how much every other token matters to it. In “the trophy didn't fit in the suitcase because it was too big,” attention helps the model link “it” to “trophy” rather than “suitcase.” Attention runs across the whole input at once, which is what lets transformers capture long-range relationships earlier designs missed.

Attention — The mechanism inside a transformer that lets each token weigh how relevant every other token is, capturing relationships across a sentence or document.

Attention is why modern models handle context so well: they don't read left-to-right with a short memory — they consider all the tokens in view together and decide what relates to what.

Step 3: predict the next token, over and over

An LLM has one core skill: given the tokens so far, it assigns a probability to every possible next token, then selects one, appends it, and repeats. “The capital of France is” makes “Paris” very high-probability. Coherent essays, code, and answers all emerge from this single loop run thousands of times.

How does it choose which token? Most chat models don't simply grab the single highest-probability token (that's called greedy decoding); they sample from the probability distribution, which adds variety. A setting called temperature controls how adventurous that sampling is: low temperature stays close to the most likely tokens (focused, repetitive), high temperature spreads the choice out (varied, creative).

Convert the prompt into tokens.
Run the tokens through the transformer to weigh their relationships.
Produce a probability for every possible next token.
Select one token (the likeliest, or a slightly randomized choice) and append it.
Feed the new, longer sequence back in and repeat until the response is complete.

Analogy

It's an extraordinarily well-read autocomplete. Your phone suggests the next word from a tiny history; an LLM suggests the next token from patterns across a vast slice of human writing — which is why it sounds fluent and informed.

Watch out

Crucially, the model optimizes for what sounds likely, not for what is true. There is no separate fact-checking step. Fluency is the goal of the mechanism; accuracy is not guaranteed by it. This single fact explains most LLM surprises — including hallucination, covered next.

An LLM is next-token prediction at scale: tokens in, attention to relate them, a probability over the next token, repeat. No hidden reasoning engine, no database of verified facts.

Knowledge check

Quick practice — not part of your exam score.

What is the single core operation an LLM performs to generate text?

In a transformer, what does the 'attention' mechanism do?

Why can a capable LLM still struggle to count the letters in a word like 'strawberry'?

← What Makes AI “Generative”Capabilities & Hard Limits: Hallucination, Context Windows & Knowledge Cutoffs →