Skip to content
Gen AI: Zero to One

Linear Algebra, Intuitively

Phase 0TypeConceptLanguagePythonTime~30 minutes

Linear algebra sounds scary. It isn't. Three ideas — vectors, the dot product, and matrix multiplication — are enough to understand what every layer of a neural network actually does. This is the intuition, not the proofs.

Learning objectives

  • Picture a vector as a point / direction in space — and see why an embedding is just a vector
  • Read the dot product as a similarity score
  • See matrix multiplication as a batch of dot products (and as a transformation)
  • Tell when vectors are linearly independent, and why rank — how many directions are real — powers tricks like LoRA
  • Connect it all to what a Transformer computes when it "pays attention"

The problem

Open any model and it's linear algebra all the way down: tokens become vectors (embeddings), a layer multiplies them by a weight matrix, and attention scores every vector against every other with dot products. If those three operations feel natural, deep learning stops being a black box. You don't need to prove theorems — you need a feel for the shapes and what they mean.

Pre-lesson check

0/2 answered

A word embedding is best described as…

The dot product of two vectors roughly tells you…

The concept

Everything scales up from the vector. A scalar is one number. A vector is a list of numbers — a point in space, or an arrow from the origin. A matrix is a grid: either a stack of vectors, or a machine that transforms vectors. Multiplication is how vectors get compared and combined.

From a single number to a neural-network layer
Scalar — one number
Vector — a list of numbers = a point/direction in space
Matrix — a stack of vectors, or a transformation
Matrix multiply — many dot products at once
A neural-network layer

Build it

Step 1 — Vectors are meaning

A vector is just a list of numbers, but you can treat each number as a coordinate. Two words with similar meaning end up as vectors pointing in similar directions — that's the whole idea behind embeddings.

And in AI, everything gets this treatment — the vector is the universal container for meaning:

ThingBecomesSo the model can…
A word / tokena vector of 768–4096 numbersplace it in "meaning space" next to related words
A whole documentone embedding vectorbe found by semantic search / RAG
An imagea vector of pixel or feature valuesbe compared, classified, captioned
A usera vector of preferencesget recommendations from nearby users
python
import numpy as np

# Toy 4-D "embeddings" — real ones have hundreds/thousands of dims
king  = np.array([0.9, 0.8, 0.1, 0.7])
queen = np.array([0.9, 0.2, 0.1, 0.8])
apple = np.array([0.1, 0.1, 0.9, 0.2])

print(king.shape)          # (4,)  -> a point in 4-D space
print(np.linalg.norm(king))  # its length (magnitude)

Step 2 — The dot product is similarity

Multiply the vectors element-wise and add it all up. One number comes out — and its sign already tells a story:

a · bGeometryRead it as
large positivepointing the same waysimilar
≈ 0perpendicular (orthogonal)unrelated
negativepointing opposite waysdissimilar / opposed

This one operation is quietly running half the modern stack: vector search embeds your query and dots it against every stored document, recommender systems dot user vectors against item vectors, and RAG retrieval is "return the chunks with the highest scores." Different products — same multiply-and-add.

python
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(round(cosine_similarity(king, queen), 3))  # high  -> similar
print(round(cosine_similarity(king, apple), 3))  # low   -> different

Step 3 — Linear independence: how many directions are real?

Give me a set of vectors. The question that matters isn't how many there are — it's how many point in genuinely new directions. A vector is redundant if you can build it by scaling and adding the others; it takes you nowhere the rest couldn't already reach.

python
v1 = np.array([1, 0, 0])
v2 = np.array([0, 1, 0])
v3 = np.array([2, 1, 0])       # = 2*v1 + v2  ->  adds no new direction

V = np.stack([v1, v2, v3])
print(np.linalg.matrix_rank(V))   # 2, not 3

v1 and v2 are independent — two real directions. But v3 = 2·v1 + v2, so it's old news: three vectors, yet only two real directions. They all lie flat in the x-y plane and can never reach [0, 0, 1]. That count of genuinely independent directions is the rank — here it's 2.

Here's the same idea as a diagnosis table — what a matrix's rank tells you about your model:

SituationRankWhat it means in ML
Full rankthe maximum possibleEvery feature adds real information; there's one best set of weights and training can find it.
Rank-deficientbelow maximumSome features are combinations of others — infinitely many weight settings fit the data equally well. Regularization is how you pick one.
Rank 11Every column is a scaled copy of one vector. A whole grid of numbers holding a single direction of information.
Almost rank-deficientfull on paper, shaky in practiceSome directions are nearly redundant, so tiny input noise causes big output swings (an "ill-conditioned" matrix). Same fix: regularize.

Step 4 — Matrix × vector is a transformation

A matrix times a vector produces a new vector — rotated, scaled, or projected into a different space. A linear layer in a network is exactly this: y = W x (plus a bias). Training is the search for the weight matrix W that maps inputs to useful outputs.

Sit with that for a second, because it flips how you see models: the matrices are the model. The weights are matrices that transform vectors, attention is a matrix of scores deciding what to look at, and the embedding table is a matrix mapping token IDs to meaning. When you download "the weights," you're downloading a stack of transformations.

python
W = np.array([[1., 0., 0., 0.],
              [0., 1., 0., 0.]])   # (2, 4): projects 4-D down to 2-D

y = W @ king                       # matrix @ vector
print(y, y.shape)                  # -> a new 2-D vector

Step 5 — Matrix × matrix is many dot products at once

Matrix multiplication is nothing more than every row of the first, dotted with every column of the second. That's why shapes have to line up: an (m × k) times a (k × n) gives an (m × n).

python
A = np.random.rand(2, 3)   # (2, 3)
B = np.random.rand(3, 4)   # (3, 4)
C = A @ B                  # (2, 4)  — inner 3's cancel
print(C.shape)
OperationShapesResultIntuition
Dot product(k) · (k)scalarone similarity score
Matrix × vector(m × k) @ (k)(m)transform one vector
Matrix × matrix(m × k) @ (k × n)(m × n)all row·column scores at once

Step 6 — This is attention

Here's the payoff. In self-attention, each token is a vector. To decide how much token i should attend to token j, the model takes the dot product of their query and key vectors — a similarity score. Doing that for every pair at once is a single matrix multiply: scores = Q @ K.T.

python
Q = np.stack([king, queen, apple])   # (3, 4): one query vector per token
K = Q.copy()                          # (3, 4): the keys

scores = Q @ K.T                      # (3, 3): every query vs every key
print(scores.shape)
print(np.round(scores, 2))            # row i = how much token i matches each token

Where each idea shows up

Nothing in this lesson is math for math's sake — each idea is doing a specific job in systems you'll build:

IdeaWhere you'll meet it
Dot productAttention scores in Transformers; scoring chunks in vector search / RAG
Cosine similarityComparing embeddings — semantic search, dedup, clustering
Matrix × vectorEvery linear layer: y = W x
Matrix × matrixAttention over all pairs at once (Q @ K.T); batching many inputs together
Linear independenceSpotting redundant features; why correlated inputs destabilize weights
RankLoRA fine-tuning, model compression, "how much signal is actually in this matrix?"

Use it

Run the demo to see similarities and an attention-score matrix print out:

bash
python foundations/linear-algebra/vectors.py

Read it inline without leaving the page: .

Ship it

This lesson produces:

  • — vectors, cosine similarity, a linear layer, and attention scores in ~30 lines of numpy
  • — the same, cell by cell, to poke at

Exercises

  1. Make two toy 3-D "word" vectors you think are similar and two that aren't; compute cosine similarity and check your intuition.
  2. Write matmul(A, B) with plain Python loops (no numpy) and verify it matches A @ B.
  3. Given Q and K of shape (5, 8), what shape is Q @ K.T? Compute it and say what a single entry means.
  4. Normalize a batch of vectors to unit length in one numpy expression.

Post-lesson quiz

0/5 answered

For A of shape (m × k) and B of shape (k × n), what shape is A @ B?

Cosine similarity differs from the raw dot product because it…

In attention, `scores = Q @ K.T` computes…

v1 = [1,0,0], v2 = [0,1,0], v3 = [2,1,0]. Are the three linearly independent?

Why can LoRA fine-tune a huge weight matrix with so few parameters?

Key terms

TermWhat people sayWhat it actually means
Vector"a list of numbers"A point / direction in space; an embedding is one
Dot product"multiply and add"One number measuring how aligned two vectors are (similarity)
Cosine similarity"the angle between them"Dot product of unit-normalized vectors — direction-only similarity, ignores magnitude
Matrix"a grid of numbers"A stack of vectors, or a transformation applied to vectors
Matrix multiplication"rows times columns"Every row·column dot product at once; (m × k) @ (k × n) = (m × n)
Linear independence"they don't overlap"No vector in the set can be built from the others — each adds a genuinely new direction
Rank"how many dimensions"The number of linearly independent directions — how much real information a matrix holds
Linear layer"a dense / fully-connected layer"W x (+ bias): a learned matrix that transforms the input vector
series progress0%
Code & notebooks for this series