Gen AI · Course · 02 of 5
Linear Algebra, Intuitively
Linear algebra sounds scary. It isn't. Three ideas — vectors, the dot product, and matrix multiplication — are enough to understand what every layer of a neural network actually does. This is the intuition, not the proofs.
Learning objectives
- Picture a vector as a point / direction in space — and see why an embedding is just a vector
- Read the dot product as a similarity score
- See matrix multiplication as a batch of dot products (and as a transformation)
- Tell when vectors are linearly independent, and why rank — how many directions are real — powers tricks like LoRA
- Connect it all to what a Transformer computes when it "pays attention"
The problem
Open any model and it's linear algebra all the way down: tokens become vectors (embeddings), a layer multiplies them by a weight matrix, and attention scores every vector against every other with dot products. If those three operations feel natural, deep learning stops being a black box. You don't need to prove theorems — you need a feel for the shapes and what they mean.
Pre-lesson check
0/2 answered// question
A word embedding is best described as…
// question
The dot product of two vectors roughly tells you…
The concept
Everything scales up from the vector. A scalar is one number. A vector is a list of numbers — a point in space, or an arrow from the origin. A matrix is a grid: either a stack of vectors, or a machine that transforms vectors. Multiplication is how vectors get compared and combined.
›From a single number to a neural-network layer
Build it
Step 1 — Vectors are meaning
A vector is just a list of numbers, but you can treat each number as a coordinate. Two words with similar meaning end up as vectors pointing in similar directions — that's the whole idea behind embeddings.
And in AI, everything gets this treatment — the vector is the universal container for meaning:
| Thing | Becomes | So the model can… |
|---|---|---|
| A word / token | a vector of 768–4096 numbers | place it in "meaning space" next to related words |
| A whole document | one embedding vector | be found by semantic search / RAG |
| An image | a vector of pixel or feature values | be compared, classified, captioned |
| A user | a vector of preferences | get recommendations from nearby users |
import numpy as np
# Toy 4-D "embeddings" — real ones have hundreds/thousands of dims
king = np.array([0.9, 0.8, 0.1, 0.7])
queen = np.array([0.9, 0.2, 0.1, 0.8])
apple = np.array([0.1, 0.1, 0.9, 0.2])
print(king.shape) # (4,) -> a point in 4-D space
print(np.linalg.norm(king)) # its length (magnitude)Step 2 — The dot product is similarity
Multiply the vectors element-wise and add it all up. One number comes out — and its sign already tells a story:
| a · b | Geometry | Read it as |
|---|---|---|
| large positive | pointing the same way | similar |
| ≈ 0 | perpendicular (orthogonal) | unrelated |
| negative | pointing opposite ways | dissimilar / opposed |
This one operation is quietly running half the modern stack: vector search embeds your query and dots it against every stored document, recommender systems dot user vectors against item vectors, and RAG retrieval is "return the chunks with the highest scores." Different products — same multiply-and-add.
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(round(cosine_similarity(king, queen), 3)) # high -> similar
print(round(cosine_similarity(king, apple), 3)) # low -> differentStep 3 — Linear independence: how many directions are real?
Give me a set of vectors. The question that matters isn't how many there are — it's how many point in genuinely new directions. A vector is redundant if you can build it by scaling and adding the others; it takes you nowhere the rest couldn't already reach.
v1 = np.array([1, 0, 0])
v2 = np.array([0, 1, 0])
v3 = np.array([2, 1, 0]) # = 2*v1 + v2 -> adds no new direction
V = np.stack([v1, v2, v3])
print(np.linalg.matrix_rank(V)) # 2, not 3v1 and v2 are independent — two real directions. But v3 = 2·v1 + v2, so it's old news: three vectors, yet only two real directions. They all lie flat in the x-y plane and can never reach [0, 0, 1]. That count of genuinely independent directions is the rank — here it's 2.
Here's the same idea as a diagnosis table — what a matrix's rank tells you about your model:
| Situation | Rank | What it means in ML |
|---|---|---|
| Full rank | the maximum possible | Every feature adds real information; there's one best set of weights and training can find it. |
| Rank-deficient | below maximum | Some features are combinations of others — infinitely many weight settings fit the data equally well. Regularization is how you pick one. |
| Rank 1 | 1 | Every column is a scaled copy of one vector. A whole grid of numbers holding a single direction of information. |
| Almost rank-deficient | full on paper, shaky in practice | Some directions are nearly redundant, so tiny input noise causes big output swings (an "ill-conditioned" matrix). Same fix: regularize. |
Step 4 — Matrix × vector is a transformation
A matrix times a vector produces a new vector — rotated, scaled, or projected into a different space. A linear layer in a network is exactly this: y = W x (plus a bias). Training is the search for the weight matrix W that maps inputs to useful outputs.
Sit with that for a second, because it flips how you see models: the matrices are the model. The weights are matrices that transform vectors, attention is a matrix of scores deciding what to look at, and the embedding table is a matrix mapping token IDs to meaning. When you download "the weights," you're downloading a stack of transformations.
W = np.array([[1., 0., 0., 0.],
[0., 1., 0., 0.]]) # (2, 4): projects 4-D down to 2-D
y = W @ king # matrix @ vector
print(y, y.shape) # -> a new 2-D vectorStep 5 — Matrix × matrix is many dot products at once
Matrix multiplication is nothing more than every row of the first, dotted with every column of the second. That's why shapes have to line up: an (m × k) times a (k × n) gives an (m × n).
A = np.random.rand(2, 3) # (2, 3)
B = np.random.rand(3, 4) # (3, 4)
C = A @ B # (2, 4) — inner 3's cancel
print(C.shape)| Operation | Shapes | Result | Intuition |
|---|---|---|---|
| Dot product | (k) · (k) | scalar | one similarity score |
| Matrix × vector | (m × k) @ (k) | (m) | transform one vector |
| Matrix × matrix | (m × k) @ (k × n) | (m × n) | all row·column scores at once |
Step 6 — This is attention
Here's the payoff. In self-attention, each token is a vector. To decide how much token i should attend to token j, the model takes the dot product of their query and key vectors — a similarity score. Doing that for every pair at once is a single matrix multiply: scores = Q @ K.T.
Q = np.stack([king, queen, apple]) # (3, 4): one query vector per token
K = Q.copy() # (3, 4): the keys
scores = Q @ K.T # (3, 3): every query vs every key
print(scores.shape)
print(np.round(scores, 2)) # row i = how much token i matches each tokenWhere each idea shows up
Nothing in this lesson is math for math's sake — each idea is doing a specific job in systems you'll build:
| Idea | Where you'll meet it |
|---|---|
| Dot product | Attention scores in Transformers; scoring chunks in vector search / RAG |
| Cosine similarity | Comparing embeddings — semantic search, dedup, clustering |
| Matrix × vector | Every linear layer: y = W x |
| Matrix × matrix | Attention over all pairs at once (Q @ K.T); batching many inputs together |
| Linear independence | Spotting redundant features; why correlated inputs destabilize weights |
| Rank | LoRA fine-tuning, model compression, "how much signal is actually in this matrix?" |
Use it
Run the demo to see similarities and an attention-score matrix print out:
python foundations/linear-algebra/vectors.pyRead it inline without leaving the page: .
Ship it
This lesson produces:
- — vectors, cosine similarity, a linear layer, and attention scores in ~30 lines of numpy
- — the same, cell by cell, to poke at
Exercises
- Make two toy 3-D "word" vectors you think are similar and two that aren't; compute cosine similarity and check your intuition.
- Write
matmul(A, B)with plain Python loops (no numpy) and verify it matchesA @ B. - Given
QandKof shape(5, 8), what shape isQ @ K.T? Compute it and say what a single entry means. - Normalize a batch of vectors to unit length in one numpy expression.
Post-lesson quiz
0/5 answered// question
For A of shape (m × k) and B of shape (k × n), what shape is A @ B?
// question
Cosine similarity differs from the raw dot product because it…
// question
In attention, `scores = Q @ K.T` computes…
// question
v1 = [1,0,0], v2 = [0,1,0], v3 = [2,1,0]. Are the three linearly independent?
// question
Why can LoRA fine-tune a huge weight matrix with so few parameters?
Key terms
| Term | What people say | What it actually means |
|---|---|---|
| Vector | "a list of numbers" | A point / direction in space; an embedding is one |
| Dot product | "multiply and add" | One number measuring how aligned two vectors are (similarity) |
| Cosine similarity | "the angle between them" | Dot product of unit-normalized vectors — direction-only similarity, ignores magnitude |
| Matrix | "a grid of numbers" | A stack of vectors, or a transformation applied to vectors |
| Matrix multiplication | "rows times columns" | Every row·column dot product at once; (m × k) @ (k × n) = (m × n) |
| Linear independence | "they don't overlap" | No vector in the set can be built from the others — each adds a genuinely new direction |
| Rank | "how many dimensions" | The number of linearly independent directions — how much real information a matrix holds |
| Linear layer | "a dense / fully-connected layer" | W x (+ bias): a learned matrix that transforms the input vector |