Gen AI · Course · 04 of 4
Putting Order Back In
Attention is permutation-invariant: shuffle the tokens and the math gives the same answer. That's a problem — "dog bites man" and "man bites dog" should not look identical. Positional encoding is how we put order back in.
Injecting position
The original Transformer adds a fixed sinusoidal signal to each token embedding. Modern models often use rotary position embeddings (RoPE), which rotate the query and key vectors by an angle that depends on position — encoding relative distance directly into the attention dot product.
import torch
def sinusoidal(seq_len, d_model):
pos = torch.arange(seq_len).unsqueeze(1)
i = torch.arange(d_model).unsqueeze(0)
angle = pos / (10000 ** (2 * (i // 2) / d_model))
pe = torch.zeros(seq_len, d_model)
pe[:, 0::2] = torch.sin(angle[:, 0::2])
pe[:, 1::2] = torch.cos(angle[:, 1::2])
return pe
// question
Why does a Transformer need positional information at all?
That's the whole skeleton: tokens in, attention to mix them, positions to order them. Everything else — multiple heads, feed-forward layers, normalization — stacks on top of these three ideas.