Gen AI · Course · 03 of 4
Self-Attention, Intuitively
Self-attention is the mechanism that lets a model decide, for every token, which other tokens matter. It's the core idea behind the Transformer — and once it clicks, the rest of the architecture is mostly plumbing.
Queries, keys, and values
Each token is projected into three vectors: a query (what am I looking for?), a key (what do I offer?), and a value (what do I pass along?). Attention scores every query against every key, normalizes with softmax, and uses the result to take a weighted sum of the values.
import torch
import torch.nn.functional as F
def attention(q, k, v):
# q, k, v: (seq_len, d_k)
scores = q @ k.transpose(-2, -1) # (seq, seq)
scores = scores / (k.size(-1) ** 0.5) # scale by sqrt(d_k)
weights = F.softmax(scores, dim=-1) # normalize over keys
return weights @ v # weighted sum of valuesWhy softmax over the keys?
Softmax turns raw scores into a probability distribution that sums to 1 across the key dimension. Each token's output is therefore a convex combination of all value vectors — a soft, differentiable lookup.
// question
In scaled dot-product attention, what does softmax normalize over?
// question
Why divide the scores by √d_k before softmax?