Skip to content
Gen AI: Zero to One

Putting Order Back In

TypeConceptTime7 min read

Attention is permutation-invariant: shuffle the tokens and the math gives the same answer. That's a problem — "dog bites man" and "man bites dog" should not look identical. Positional encoding is how we put order back in.

Injecting position

The original Transformer adds a fixed sinusoidal signal to each token embedding. Modern models often use rotary position embeddings (RoPE), which rotate the query and key vectors by an angle that depends on position — encoding relative distance directly into the attention dot product.

python
import torch

def sinusoidal(seq_len, d_model):
    pos = torch.arange(seq_len).unsqueeze(1)
    i = torch.arange(d_model).unsqueeze(0)
    angle = pos / (10000 ** (2 * (i // 2) / d_model))
    pe = torch.zeros(seq_len, d_model)
    pe[:, 0::2] = torch.sin(angle[:, 0::2])
    pe[:, 1::2] = torch.cos(angle[:, 1::2])
    return pe
Heatmap of sinusoidal positional encodings
Each row is a position; each column a dimension. Replace this with your own diagram.

Why does a Transformer need positional information at all?

That's the whole skeleton: tokens in, attention to mix them, positions to order them. Everything else — multiple heads, feed-forward layers, normalization — stacks on top of these three ideas.

series progress0%
Code & notebooks for this series