Skip to content
Gen AI: Zero to One

What Is a Token?

TypeConceptTime5 min read

Before a language model can reason about text, it has to turn that text into numbers. The unit it works with isn't a word and isn't a character — it's a token. Understanding tokens is the first step to understanding why models behave the way they do.

What a token actually is

A token is a chunk of text from a fixed vocabulary. Common words are usually a single token, while rare words get split into several. The model never sees letters — it sees a sequence of integer token IDs.

python
# Roughly how text becomes token IDs
text = "Tokenization is unfamiliar."
tokens = ["Token", "ization", " is", " un", "fam", "iliar", "."]
ids = [30642, 1634, 318, 555, 13635, 4797, 13]

Notice that "unfamiliar" shattered into four pieces while "is" stayed whole. That asymmetry is exactly why token counts rarely match word counts.

Why it matters

  • Context windows are measured in tokens, not words — so is your bill.
  • Rare names, code, and non-English text cost more tokens than they look.
  • The same string can tokenize differently across models with different vocabularies.

Roughly how many tokens is a 1,000-word English blog post?

Next up: how the model decides which chunks become tokens in the first place — the tokenizer itself.

series progress0%
Code & notebooks for this series