Compare · vs. nanoGPT

zigllm vs. nanoGPT

Karpathy's nanoGPT is the canonical pedagogical GPT in Python. zigllm is the same impulse — but in Zig, with manual memory, and 18 architecture families instead of one. They are complementary readings, not competitors.

Verdict

Read nanoGPT first to see the math without distractions. Read zigllm next to see what the math looks like once you take the runtime away.

nanoGPT ↗

If you want to understand how a transformer works by reading code, you almost certainly start with Karpathy’s nanoGPT. It is the canonical pedagogical GPT implementation: a few hundred lines of PyTorch that train and generate from a GPT-style model, written with the explicit goal of being legible.

zigllm is the same impulse, with two differences: it is written in Zig instead of Python, and it covers eighteen architecture families instead of one. The two projects are best read together, in that order.

What nanoGPT does well

nanoGPT is unmatched as a first read. PyTorch hides all of the bookkeeping. A tensor is a tensor. A backward pass is .backward(). A matmul is @. This is enormously useful when the thing you are trying to learn is the mathematical structure of a transformer — the attention formula, the residual stream, the way embeddings flow through layers and turn into logits.

You don’t want to be reading about strides when you are first trying to understand causal attention. nanoGPT respects that. It also includes training, which zigllm does not — Karpathy walks you through pretraining a small GPT on a corpus, which is the part of the lifecycle that zigllm deliberately punts on.

If you do not yet have an intuition for what self-attention computes, read nanoGPT first.

What zigllm does that nanoGPT doesn’t

Once you have the math, the next thing you want to know is what it costs. That is where nanoGPT, by virtue of being PyTorch, abstracts too much. The matmul is @. You do not see the SIMD lanes. You do not see the stride traversal. You do not see how the weight tensor is laid out in memory. The autograd engine and the CUDA backend handle all of it.

zigllm strips those abstractions away. The matmul is a kernel you can read. The tensor is a struct with a flat buffer, a shape, and a stride. There is no autograd, because there is no training — the project is inference-only, and that focus lets it spend more pages per concept on the part of the stack that matters at runtime.

It also covers eighteen architecture families. nanoGPT teaches you GPT. zigllm shows you how LLaMA differs from GPT-2 (RMSNorm vs LayerNorm, RoPE vs learned position embeddings, SwiGLU vs GELU), how Mistral differs from LLaMA (sliding-window attention), how Mamba differs from all of them (state-space, not attention at all). The comparative reading is its own pedagogy.

Where they overlap and where they don’t

	nanoGPT	zigllm
Language	Python (PyTorch)	Zig
Scope	Pretraining + inference, one architecture	Inference only, 18 architecture families
Tensors	`torch.Tensor`	Hand-rolled, shape + stride + dtype
Matmul	`@` (delegated to BLAS / CUDA)	Hand-written kernels with SIMD intrinsics
Quantization	Not the focus	Central — 18+ formats, K- and IQ-quant
Memory	Managed	Manual, with arenas
Sampling	A handful of methods	Full set including Mirostat, grammar-constrained
Goal	”Train a small GPT yourself"	"See the inference stack with nothing hidden”

The complementary reading order

The way I would suggest reading them, if you have time for both:

nanoGPT first, end to end. Don’t worry about performance. Worry about the math. Run the training script on Shakespeare. Watch loss go down. Read the model in model.py and make sure you can draw a transformer block on a napkin.
zigllm Layer 1–2, to convert your “tensor = a thing that holds numbers” intuition into “tensor = a flat buffer with shape, stride, and dtype.” This is the conceptual move that nanoGPT does not force you to make.
zigllm Layer 3–4, to see RMSNorm, RoPE, and SwiGLU as explicit functions of tensors. You will already know what attention computes from nanoGPT; here you learn what it costs structurally.
zigllm Layer 5, to see eighteen architecture families side by side. This is the comparative anatomy lab.
zigllm Layer 6, to see what nanoGPT’s model.generate() actually does, expanded into the sampler zoo and the KV cache.

By the end of this sequence you will have learned the same transformer twice — once with the runtime helping you, once with the runtime out of the way. Both perspectives are necessary.

Final note

nanoGPT and zigllm are not competing projects. They are different sections of the same imaginary textbook. nanoGPT is the chapter on what a transformer computes. zigllm is the chapter on how that computation is structured in memory and on a CPU. Read them in order and you will know more about transformers than most people who only ever use them through an API.

← All comparisons Curriculum →