Skip to content
All projects

Transformer LM from Scratch

A modern decoder-only Transformer built from first principles in PyTorch — so the production GenAI stack is never a black box.

  • BPE · RoPE · SwiGLU · RMSNorm
  • Pre-norm decoder
  • CS336 test contracts
PythonPyTorchNumPy

Personal · self-study

The problem

It’s easy to wire up LLM SDK calls without understanding what’s underneath. I wanted to reason about the failure modes of the models I ship — not just call them.

What I built

  • Implemented the full modern stack from scratch — a BPE tokenizer, rotary position embeddings (RoPE), a SwiGLU feed-forward, RMSNorm, and pre-norm decoder blocks.
  • Targeted Stanford CS336’s test contracts, so the implementation matches real interface shapes rather than hand-wavy math.
  • Worked the math first — backprop derived by hand and checked against numerical gradients — before reaching for autograd.

Key technical decisions

Implement before you import

Every primitive is written by hand before any high-level shortcut is allowed — that is the whole point of the exercise.

Modern architecture, not the 2017 original

Chose pre-norm + RMSNorm + SwiGLU + RoPE (Llama-class) over the original post-norm Transformer, to match what production LLMs actually use.

Outcomes

  • Config fields like rope_theta / rms_norm_eps read as concrete choices, not opaque JSON
  • A solid mental model for debugging real training and inference

Want to talk through any of this?

jntkhandebharad@gmail.com