Foundations · PyTorch

Transformer LM from Scratch

A modern decoder-only Transformer built from first principles in PyTorch — so the production GenAI stack is never a black box.

BPE · RoPE · SwiGLU · RMSNorm
Pre-norm decoder
CS336 test contracts

PythonPyTorchNumPy

Personal · self-study

// problem

The problem

It’s easy to wire up LLM SDK calls without understanding what’s underneath. I wanted to reason about the failure modes of the models I ship — not just call them.

// approach

What I built

Implemented the full modern stack from scratch — a BPE tokenizer, rotary position embeddings (RoPE), a SwiGLU feed-forward, RMSNorm, and pre-norm decoder blocks.
Targeted Stanford CS336’s test contracts, so the implementation matches real interface shapes rather than hand-wavy math.
Worked the math first — backprop derived by hand and checked against numerical gradients — before reaching for autograd.

// decisions

Key technical decisions

Implement before you import

Every primitive is written by hand before any high-level shortcut is allowed — that is the whole point of the exercise.

Modern architecture, not the 2017 original

Chose pre-norm + RMSNorm + SwiGLU + RoPE (Llama-class) over the original post-norm Transformer, to match what production LLMs actually use.

// outcomes

Outcomes

Config fields like rope_theta / rms_norm_eps read as concrete choices, not opaque JSON
A solid mental model for debugging real training and inference

Want to talk through any of this?

jntkhandebharad@gmail.com