Gen AI · Course · 02 of 5

Linear Algebra, Intuitively

Phase 0TypeConceptLanguagePythonTime~30 minutes

Linear algebra sounds scary. It isn't. Three ideas — vectors, the dot product, and matrix multiplication — are enough to understand what every layer of a neural network actually does. This is the intuition, not the proofs.

Learning objectives

Picture a vector as a point / direction in space — and see why an embedding is just a vector
Read the dot product as a similarity score
See matrix multiplication as a batch of dot products (and as a transformation)
Tell when vectors are linearly independent, and why rank — how many directions are real — powers tricks like LoRA
Connect it all to what a Transformer computes when it "pays attention"

The problem

Open any model and it's linear algebra all the way down: tokens become vectors (embeddings), a layer multiplies them by a weight matrix, and attention scores every vector against every other with dot products. If those three operations feel natural, deep learning stops being a black box. You don't need to prove theorems — you need a feel for the shapes and what they mean.

Pre-lesson check

0/2 answered

// question

A word embedding is best described as…

// question

The dot product of two vectors roughly tells you…

The concept

Everything scales up from the vector. A scalar is one number. A vector is a list of numbers — a point in space, or an arrow from the origin. A matrix is a grid: either a stack of vectors, or a machine that transforms vectors. Multiplication is how vectors get compared and combined.

›From a single number to a neural-network layer

Scalar — one number

Vector — a list of numbers = a point/direction in space

Matrix — a stack of vectors, or a transformation

Matrix multiply — many dot products at once

A neural-network layer

Build it

Step 1 — Vectors are meaning

A vector is just a list of numbers, but you can treat each number as a coordinate. Two words with similar meaning end up as vectors pointing in similar directions — that's the whole idea behind embeddings.

And in AI, everything gets this treatment — the vector is the universal container for meaning:

Thing	Becomes	So the model can…
A word / token	a vector of 768–4096 numbers	place it in "meaning space" next to related words
A whole document	one embedding vector	be found by semantic search / RAG
An image	a vector of pixel or feature values	be compared, classified, captioned
A user	a vector of preferences	get recommendations from nearby users

python

import numpy as np

# Toy 4-D "embeddings" — real ones have hundreds/thousands of dims
king  = np.array([0.9, 0.8, 0.1, 0.7])
queen = np.array([0.9, 0.2, 0.1, 0.8])
apple = np.array([0.1, 0.1, 0.9, 0.2])

print(king.shape)          # (4,)  -> a point in 4-D space
print(np.linalg.norm(king))  # its length (magnitude)

Step 2 — The dot product is similarity

Multiply the vectors element-wise and add it all up. One number comes out — and its sign already tells a story:

a · b	Geometry	Read it as
large positive	pointing the same way	similar
≈ 0	perpendicular (orthogonal)	unrelated
negative	pointing opposite ways	dissimilar / opposed

This one operation is quietly running half the modern stack: vector search embeds your query and dots it against every stored document, recommender systems dot user vectors against item vectors, and RAG retrieval is "return the chunks with the highest scores." Different products — same multiply-and-add.

python

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print(round(cosine_similarity(king, queen), 3))  # high  -> similar
print(round(cosine_similarity(king, apple), 3))  # low   -> different

Step 3 — Linear independence: how many directions are real?

Give me a set of vectors. The question that matters isn't how many there are — it's how many point in genuinely new directions. A vector is redundant if you can build it by scaling and adding the others; it takes you nowhere the rest couldn't already reach.

python

v1 = np.array([1, 0, 0])
v2 = np.array([0, 1, 0])
v3 = np.array([2, 1, 0])       # = 2*v1 + v2  ->  adds no new direction

V = np.stack([v1, v2, v3])
print(np.linalg.matrix_rank(V))   # 2, not 3

v1 and v2 are independent — two real directions. But v3 = 2·v1 + v2, so it's old news: three vectors, yet only two real directions. They all lie flat in the x-y plane and can never reach [0, 0, 1]. That count of genuinely independent directions is the rank — here it's 2.

Here's the same idea as a diagnosis table — what a matrix's rank tells you about your model:

Situation	Rank	What it means in ML
Full rank	the maximum possible	Every feature adds real information; there's one best set of weights and training can find it.
Rank-deficient	below maximum	Some features are combinations of others — infinitely many weight settings fit the data equally well. Regularization is how you pick one.
Rank 1	1	Every column is a scaled copy of one vector. A whole grid of numbers holding a single direction of information.
Almost rank-deficient	full on paper, shaky in practice	Some directions are nearly redundant, so tiny input noise causes big output swings (an "ill-conditioned" matrix). Same fix: regularize.

Step 4 — Matrix × vector is a transformation

A matrix times a vector produces a new vector — rotated, scaled, or projected into a different space. A linear layer in a network is exactly this: y = W x (plus a bias). Training is the search for the weight matrix W that maps inputs to useful outputs.

Sit with that for a second, because it flips how you see models: the matrices are the model. The weights are matrices that transform vectors, attention is a matrix of scores deciding what to look at, and the embedding table is a matrix mapping token IDs to meaning. When you download "the weights," you're downloading a stack of transformations.

python

W = np.array([[1., 0., 0., 0.],
              [0., 1., 0., 0.]])   # (2, 4): projects 4-D down to 2-D

y = W @ king                       # matrix @ vector
print(y, y.shape)                  # -> a new 2-D vector

Step 5 — Matrix × matrix is many dot products at once

Matrix multiplication is nothing more than every row of the first, dotted with every column of the second. That's why shapes have to line up: an (m × k) times a (k × n) gives an (m × n).

python

A = np.random.rand(2, 3)   # (2, 3)
B = np.random.rand(3, 4)   # (3, 4)
C = A @ B                  # (2, 4)  — inner 3's cancel
print(C.shape)

Operation	Shapes	Result	Intuition
Dot product	(k) · (k)	scalar	one similarity score
Matrix × vector	(m × k) @ (k)	(m)	transform one vector
Matrix × matrix	(m × k) @ (k × n)	(m × n)	all row·column scores at once

Step 6 — This is attention

Here's the payoff. In self-attention, each token is a vector. To decide how much token i should attend to token j, the model takes the dot product of their query and key vectors — a similarity score. Doing that for every pair at once is a single matrix multiply: scores = Q @ K.T.

python

Q = np.stack([king, queen, apple])   # (3, 4): one query vector per token
K = Q.copy()                          # (3, 4): the keys

scores = Q @ K.T                      # (3, 3): every query vs every key
print(scores.shape)
print(np.round(scores, 2))            # row i = how much token i matches each token

Where each idea shows up

Nothing in this lesson is math for math's sake — each idea is doing a specific job in systems you'll build:

Idea	Where you'll meet it
Dot product	Attention scores in Transformers; scoring chunks in vector search / RAG
Cosine similarity	Comparing embeddings — semantic search, dedup, clustering
Matrix × vector	Every linear layer: `y = W x`
Matrix × matrix	Attention over all pairs at once (`Q @ K.T`); batching many inputs together
Linear independence	Spotting redundant features; why correlated inputs destabilize weights
Rank	LoRA fine-tuning, model compression, "how much signal is actually in this matrix?"

Use it

Run the demo to see similarities and an attention-score matrix print out:

bash

python foundations/linear-algebra/vectors.py

Read it inline without leaving the page: .

Ship it

This lesson produces:

— vectors, cosine similarity, a linear layer, and attention scores in ~30 lines of numpy
— the same, cell by cell, to poke at

Exercises

Make two toy 3-D "word" vectors you think are similar and two that aren't; compute cosine similarity and check your intuition.
Write matmul(A, B) with plain Python loops (no numpy) and verify it matches A @ B.
Given Q and K of shape (5, 8), what shape is Q @ K.T? Compute it and say what a single entry means.
Normalize a batch of vectors to unit length in one numpy expression.

Post-lesson quiz

0/5 answered

// question

For A of shape (m × k) and B of shape (k × n), what shape is A @ B?

// question

Cosine similarity differs from the raw dot product because it…

// question

In attention, `scores = Q @ K.T` computes…

// question

v1 = [1,0,0], v2 = [0,1,0], v3 = [2,1,0]. Are the three linearly independent?

// question

Why can LoRA fine-tune a huge weight matrix with so few parameters?

Key terms

Term	What people say	What it actually means
Vector	"a list of numbers"	A point / direction in space; an embedding is one
Dot product	"multiply and add"	One number measuring how aligned two vectors are (similarity)
Cosine similarity	"the angle between them"	Dot product of unit-normalized vectors — direction-only similarity, ignores magnitude
Matrix	"a grid of numbers"	A stack of vectors, or a transformation applied to vectors
Matrix multiplication	"rows times columns"	Every row·column dot product at once; (m × k) @ (k × n) = (m × n)
Linear independence	"they don't overlap"	No vector in the set can be built from the others — each adds a genuinely new direction
Rank	"how many dimensions"	The number of linearly independent directions — how much real information a matrix holds
Linear layer	"a dense / fully-connected layer"	`W x` (+ bias): a learned matrix that transforms the input vector

series progress0%

Code & notebooks for this series