Compare · vs. llm.c

zigllm vs. llm.c

Karpathy's llm.c is the C-from-scratch counterpart to nanoGPT — pure C, no dependencies, training included. zigllm is its Zig cousin, inference-only, with more architecture families and SIMD-first kernels.

Verdict

Read llm.c if you want to see training-from-scratch in raw C. Read zigllm if you want to see inference-from-scratch in Zig with manual memory and quantization.

llm.c ↗

llm.c is the natural counterpart to zigllm. Both projects share an aesthetic: no framework, no runtime, the math is visible. The difference is which half of the model lifecycle they care about, and which language they reach for.

This page exists because if you found your way to zigllm, you very likely also found your way to llm.c. Here is how I think they relate.

What llm.c does

llm.c is GPT-2 training and inference written in pure C, with optional CUDA, and essentially zero dependencies. The headline feat is that the entire forward pass, backward pass, optimizer, and training loop fit in a single file you can read in a sitting. There is no PyTorch, no TensorFlow, no torch.nn.Linear. There are arrays and there is C.

This is enormously instructive for the same reason zigllm is: it strips away the abstractions that the framework would normally provide. When you read llm.c, you see exactly how the gradient of a layer norm is computed, how the optimizer updates parameters, how the data loader shuffles batches. The training loop is not a black box. It is a function you can step through.

llm.c also includes CUDA kernels, which zigllm does not. If your goal is to understand GPU-accelerated training of a small GPT, llm.c is unmatched.

What zigllm does differently

zigllm makes three deliberate choices that diverge from llm.c.

Inference only. zigllm does not implement training. There is no backward pass, no optimizer, no gradient checkpointing. This is a focus, not an oversight. The project’s claim is that inference is interesting enough — and structurally distinct enough — to deserve a whole codebase to itself. KV caching, quantization, streaming, grammar-constrained sampling: these only exist on the inference side, and they are where most of the interesting engineering of modern LLM serving lives.

Eighteen architecture families instead of one. llm.c is GPT-2. zigllm is GPT-2 and LLaMA and Mistral and Mamba and BERT and fifteen others. The comparative angle — seeing how GPT-2 differs from LLaMA differs from Mistral — is a different educational structure than “read one model deeply.”

Zig, not C. This sounds like a small change, and at the kernel level it almost is — zigllm’s matmul kernels would be recognizable to a C programmer. But Zig gives you comptime, which means the tensor type can be generic over dtype without runtime dispatch. It gives you first-class SIMD intrinsics, which means the SIMD kernel reads like a vectorized loop instead of a wall of _mm256_* calls. It gives you defer-style cleanup, which makes manual memory discipline less painful than in C. The kernels are not radically different; the reading experience is.

Side by side

	llm.c	zigllm
Language	C (+ CUDA)	Zig
Scope	Training + inference, GPT-2	Inference, 18 architectures
Dependencies	None	None
Single-file?	Roughly	No — organized by layer
SIMD	Manual / via compiler	First-class intrinsics, explicit
Quantization	Not the focus	Central — 18+ formats
GPU	Yes (CUDA)	No, CPU only
Test count	Comparatively few	285+

Where they meet

Both projects make the same fundamental wager: that you learn more from reading a small, honest implementation than from reading the same algorithm wrapped in three layers of framework. Both are written in the kind of low-level language that forces you to think about memory. Both treat the matmul as something you implement, not something you call.

If you read llm.c and then read zigllm, you will have seen the same fundamental object — a transformer — from two adjacent languages and two adjacent vantage points. The training perspective in llm.c teaches you what the weights mean (the result of an optimization process). The inference perspective in zigllm teaches you what the weights cost (memory layout, kernel access patterns, quantization tricks).

The complementary reading order

If you are doing both:

llm.c first, to see training. Read train_gpt2.c end to end. Don’t be afraid of the C.
zigllm Layer 1–2, to convert your training-time tensor intuition into an inference-time one. The shapes are familiar; the kernels are not.
zigllm Layer 5, to expand your single-architecture intuition (GPT-2 in llm.c) into a comparative one. LLaMA, Mistral, Mamba — what changes, what doesn’t.
zigllm Layer 6, to see what inference looks like as a serious system: KV cache, sampler zoo, grammar constraints.

llm.c is the “I trained a GPT-2 from scratch” story. zigllm is the “I ran an LLM with no hidden machinery” story. The intersection is the matmul. The differences are the parts of the LLM lifecycle each project chose to make legible.

Final note

llm.c and zigllm are both expressions of the same philosophy: low-level languages used to make machine learning legible. They differ on language, on lifecycle, and on scope. If you only read one, read the one that matches the half of the lifecycle you care about. If you read both, read them in the order above — and you’ll find that the two implementations explain each other.

← All comparisons Curriculum →