Foundations · PyTorch
Transformer LM from Scratch
A modern decoder-only Transformer built from first principles in PyTorch — so the production GenAI stack is never a black box.
- BPE · RoPE · SwiGLU · RMSNorm
- Pre-norm decoder
- CS336 test contracts
PythonPyTorchNumPy
Personal · self-study
// problem
The problem
It’s easy to wire up LLM SDK calls without understanding what’s underneath. I wanted to reason about the failure modes of the models I ship — not just call them.
// approach
What I built
- Implemented the full modern stack from scratch — a BPE tokenizer, rotary position embeddings (RoPE), a SwiGLU feed-forward, RMSNorm, and pre-norm decoder blocks.
- Targeted Stanford CS336’s test contracts, so the implementation matches real interface shapes rather than hand-wavy math.
- Worked the math first — backprop derived by hand and checked against numerical gradients — before reaching for autograd.
// decisions
Key technical decisions
Implement before you import
Every primitive is written by hand before any high-level shortcut is allowed — that is the whole point of the exercise.
Modern architecture, not the 2017 original
Chose pre-norm + RMSNorm + SwiGLU + RoPE (Llama-class) over the original post-norm Transformer, to match what production LLMs actually use.
// outcomes
Outcomes
- Config fields like rope_theta / rms_norm_eps read as concrete choices, not opaque JSON
- A solid mental model for debugging real training and inference
Want to talk through any of this?
jntkhandebharad@gmail.com