Skip to content
Gen AI: Zero to One

Self-Attention, Intuitively

TypeConceptTime9 min read

Self-attention is the mechanism that lets a model decide, for every token, which other tokens matter. It's the core idea behind the Transformer — and once it clicks, the rest of the architecture is mostly plumbing.

Queries, keys, and values

Each token is projected into three vectors: a query (what am I looking for?), a key (what do I offer?), and a value (what do I pass along?). Attention scores every query against every key, normalizes with softmax, and uses the result to take a weighted sum of the values.

python
import torch
import torch.nn.functional as F

def attention(q, k, v):
    # q, k, v: (seq_len, d_k)
    scores = q @ k.transpose(-2, -1)          # (seq, seq)
    scores = scores / (k.size(-1) ** 0.5)     # scale by sqrt(d_k)
    weights = F.softmax(scores, dim=-1)        # normalize over keys
    return weights @ v                         # weighted sum of values

Why softmax over the keys?

Softmax turns raw scores into a probability distribution that sums to 1 across the key dimension. Each token's output is therefore a convex combination of all value vectors — a soft, differentiable lookup.

In scaled dot-product attention, what does softmax normalize over?

Why divide the scores by √d_k before softmax?

series progress0%
Code & notebooks for this series