Gen AI · Course · 02 of 4
What Is a Token?
Before a language model can reason about text, it has to turn that text into numbers. The unit it works with isn't a word and isn't a character — it's a token. Understanding tokens is the first step to understanding why models behave the way they do.
What a token actually is
A token is a chunk of text from a fixed vocabulary. Common words are usually a single token, while rare words get split into several. The model never sees letters — it sees a sequence of integer token IDs.
# Roughly how text becomes token IDs
text = "Tokenization is unfamiliar."
tokens = ["Token", "ization", " is", " un", "fam", "iliar", "."]
ids = [30642, 1634, 318, 555, 13635, 4797, 13]Notice that "unfamiliar" shattered into four pieces while "is" stayed whole. That asymmetry is exactly why token counts rarely match word counts.
Why it matters
- Context windows are measured in tokens, not words — so is your bill.
- Rare names, code, and non-English text cost more tokens than they look.
- The same string can tokenize differently across models with different vocabularies.
// question
Roughly how many tokens is a 1,000-word English blog post?
Next up: how the model decides which chunks become tokens in the first place — the tokenizer itself.