zigllm vs. llm.c
Karpathy's llm.c is the C-from-scratch counterpart to nanoGPT — pure C, no dependencies, training included. zigllm is its Zig cousin, inference-only, with more architecture families and SIMD-first kernels.
Read llm.c if you want to see training-from-scratch in raw C. Read zigllm if you want to see inference-from-scratch in Zig with manual memory and quantization.
llm.c is the natural counterpart to zigllm. Both projects share an aesthetic: no framework, no runtime, the math is visible. The difference is which half of the model lifecycle they care about, and which language they reach for.
This page exists because if you found your way to zigllm, you very likely also found your way to llm.c. Here is how I think they relate.
What llm.c does
llm.c is GPT-2 training and inference written in pure C, with optional CUDA, and essentially
zero dependencies. The headline feat is that the entire forward pass, backward pass, optimizer,
and training loop fit in a single file you can read in a sitting. There is no PyTorch, no
TensorFlow, no torch.nn.Linear. There are arrays and there is C.
This is enormously instructive for the same reason zigllm is: it strips away the abstractions that the framework would normally provide. When you read llm.c, you see exactly how the gradient of a layer norm is computed, how the optimizer updates parameters, how the data loader shuffles batches. The training loop is not a black box. It is a function you can step through.
llm.c also includes CUDA kernels, which zigllm does not. If your goal is to understand GPU-accelerated training of a small GPT, llm.c is unmatched.
What zigllm does differently
zigllm makes three deliberate choices that diverge from llm.c.
Inference only. zigllm does not implement training. There is no backward pass, no optimizer, no gradient checkpointing. This is a focus, not an oversight. The project’s claim is that inference is interesting enough — and structurally distinct enough — to deserve a whole codebase to itself. KV caching, quantization, streaming, grammar-constrained sampling: these only exist on the inference side, and they are where most of the interesting engineering of modern LLM serving lives.
Eighteen architecture families instead of one. llm.c is GPT-2. zigllm is GPT-2 and LLaMA and Mistral and Mamba and BERT and fifteen others. The comparative angle — seeing how GPT-2 differs from LLaMA differs from Mistral — is a different educational structure than “read one model deeply.”
Zig, not C. This sounds like a small change, and at the kernel level it almost is —
zigllm’s matmul kernels would be recognizable to a C programmer. But Zig gives you comptime,
which means the tensor type can be generic over dtype without runtime dispatch. It gives you
first-class SIMD intrinsics, which means the SIMD kernel reads like a vectorized loop instead
of a wall of _mm256_* calls. It gives you defer-style cleanup, which makes manual memory
discipline less painful than in C. The kernels are not radically different; the reading
experience is.
Side by side
| llm.c | zigllm | |
|---|---|---|
| Language | C (+ CUDA) | Zig |
| Scope | Training + inference, GPT-2 | Inference, 18 architectures |
| Dependencies | None | None |
| Single-file? | Roughly | No — organized by layer |
| SIMD | Manual / via compiler | First-class intrinsics, explicit |
| Quantization | Not the focus | Central — 18+ formats |
| GPU | Yes (CUDA) | No, CPU only |
| Test count | Comparatively few | 285+ |
Where they meet
Both projects make the same fundamental wager: that you learn more from reading a small, honest implementation than from reading the same algorithm wrapped in three layers of framework. Both are written in the kind of low-level language that forces you to think about memory. Both treat the matmul as something you implement, not something you call.
If you read llm.c and then read zigllm, you will have seen the same fundamental object — a transformer — from two adjacent languages and two adjacent vantage points. The training perspective in llm.c teaches you what the weights mean (the result of an optimization process). The inference perspective in zigllm teaches you what the weights cost (memory layout, kernel access patterns, quantization tricks).
The complementary reading order
If you are doing both:
- llm.c first, to see training. Read
train_gpt2.cend to end. Don’t be afraid of the C. - zigllm Layer 1–2, to convert your training-time tensor intuition into an inference-time one. The shapes are familiar; the kernels are not.
- zigllm Layer 5, to expand your single-architecture intuition (GPT-2 in llm.c) into a comparative one. LLaMA, Mistral, Mamba — what changes, what doesn’t.
- zigllm Layer 6, to see what inference looks like as a serious system: KV cache, sampler zoo, grammar constraints.
llm.c is the “I trained a GPT-2 from scratch” story. zigllm is the “I ran an LLM with no hidden machinery” story. The intersection is the matmul. The differences are the parts of the LLM lifecycle each project chose to make legible.
Final note
llm.c and zigllm are both expressions of the same philosophy: low-level languages used to make machine learning legible. They differ on language, on lifecycle, and on scope. If you only read one, read the one that matches the half of the lifecycle you care about. If you read both, read them in the order above — and you’ll find that the two implementations explain each other.