Gen AI · Course · 02 of 4

What Is a Token?

TypeConceptTime5 min read

Before a language model can reason about text, it has to turn that text into numbers. The unit it works with isn't a word and isn't a character — it's a token. Understanding tokens is the first step to understanding why models behave the way they do.

What a token actually is

A token is a chunk of text from a fixed vocabulary. Common words are usually a single token, while rare words get split into several. The model never sees letters — it sees a sequence of integer token IDs.

python

# Roughly how text becomes token IDs
text = "Tokenization is unfamiliar."
tokens = ["Token", "ization", " is", " un", "fam", "iliar", "."]
ids = [30642, 1634, 318, 555, 13635, 4797, 13]

Notice that "unfamiliar" shattered into four pieces while "is" stayed whole. That asymmetry is exactly why token counts rarely match word counts.

Why it matters

Context windows are measured in tokens, not words — so is your bill.
Rare names, code, and non-English text cost more tokens than they look.
The same string can tokenize differently across models with different vocabularies.

// question

Roughly how many tokens is a 1,000-word English blog post?

Next up: how the model decides which chunks become tokens in the first place — the tokenizer itself.

series progress0%

Code & notebooks for this series