What you actually learn by writing the matmul yourself
Matrix multiplication is the heart of a transformer. Writing it by hand in Zig — strides, SIMD, K-quantization — is where 'I know how this works' stops being a claim and starts being legible.
There is a moment, when you read about transformers for the first time, where someone draws a box labelled Linear and says “this is just a matrix multiplication.” You nod. You move on. The arrow into the box becomes a tensor. The arrow out becomes another tensor. The shape of the box itself is a hyperparameter on a slide.
This is a perfectly fine pedagogical move. It is also where most readers stop learning. Because the Linear box is not just a matrix multiplication. It is a stride pattern, a memory layout, a SIMD lane width, an unrolling decision, a quantization scheme, and — if you are unlucky — three layers of framework indirection. Almost all of the actual cost of running a model is in that box. And the box, drawn flat on a slide, hides every cost.
So when zigllm asks you to write the matmul yourself, it is not asking you to perform a rite of passage. It is asking you to make the cost visible.
You learn that a tensor is a stride
The first thing you learn is that a tensor is not really a multi-dimensional array. It is a flat
buffer of numbers, plus a shape, plus a stride. The stride tells you how to convert from
[i, j] into a flat index. Two tensors with the same shape can have very different strides —
which is what transpose actually is. Transposing a matrix does not move any data. It rewrites
the stride.
This shows up in the code as soon as you try to multiply A @ B and A.T @ B and see that the
second version, naively, is far slower. The data is the same. The traversal pattern is not. The
cache lines you load are not. Slides do not teach you this. A textbook can hint at it. The
matmul kernel forces you to know it.
You learn that SIMD is just structured laziness
zigllm’s linear algebra layer uses Zig’s first-class SIMD intrinsics. The kernel takes a row of
A, a column of B, multiplies them lane-by-lane, sums the result. With AVX2 that is eight
f32 lanes at a time. With AVX-512 it is sixteen. With nothing, you fall back to scalar.
The lesson you take away is structural, not numerical. SIMD is not magic. It is a way of saying: “I am about to do the same thing eight times in a row with no branches between them. Stop pretending you don’t know that.” The kernel reads like an honest description of the work. The 3-5x speedup over a scalar baseline is not a clever trick. It is what happens when you stop hiding structure from the CPU.
You learn that quantization is a memory argument
The README claims 18+ quantization formats, with up to 95% memory reduction. The number is striking, but the lesson is structural again. Quantization is not “we made the weights smaller.” It is “we changed how the weights are addressed.”
A K-quantized block in GGUF is a tiny structure: a scale, sometimes a zero-point, and a packed block of low-bit values. To run a matmul against quantized weights, the kernel does not first “decompress” the weight matrix. That would defeat the purpose. Instead, the dot-product kernel itself learns to read the packed format directly. The scale is multiplied at the end. The zero-point is subtracted on the fly. The 4-bit values are unpacked into SIMD lanes one block at a time.
Reading the zigllm quantization kernels alongside an arithmetic-heavy diagram of the same operation is the cleanest way I have seen to internalize what quantization actually buys you. It is not “less math.” It is the same math against a smaller, cleverer representation of the weights. The kernel is where the cleverness lives.
You learn what a test can prove
The project ships 285+ tests, and the linear algebra layer is where they earn their keep. Every kernel — naive, SIMD, quantized — is checked against a reference implementation on small inputs with known answers. When you change the SIMD path, you find out within a second whether you broke a sign or a shape. When you add a new quantization format, the test you write to prove the dequant-and-multiply path matches the float path is the specification of that format.
This is the part of the project where “executable documentation” stops being a slogan. You can
read the math in a textbook. You can read the kernel in src/linear_algebra/. You can read the
test that ties one to the other. And you can change any of the three and watch the others
either agree or scream.
You learn that the cache is a coauthor
There is a final lesson that creeps in once you have read the kernel a few times: most of the work in a matmul is not arithmetic, it is waiting. Memory is slow. The CPU can finish a multiply-accumulate in a single cycle, but loading the operands from main memory takes hundreds. Every clever matmul kernel ever written is, fundamentally, a strategy for keeping the multiplier fed.
Tiling is the simplest such strategy. You break the matrices into small blocks that fit in L1 cache, multiply those blocks, accumulate. The arithmetic is the same. The order is different. The order is the whole performance story.
You cannot see this in a slide, and you cannot see it in a textbook formula. You can see it when you write a naive matmul, time it, write a tiled matmul, time it, and watch a 4× gap appear out of an identical flop count. That gap is the cache. The kernel is where you make peace with it.
What you take away
You take away the realization that the Linear box is the entire performance story of a transformer. Everything else — the attention sums, the layer norms, the activations — is overhead bookkeeping compared to the dense matmuls that dominate the flop budget. Optimizing those kernels is the work. Reading those kernels is the education.
After Layer 2 you can look at any blog post about “we made inference 1.4× faster” and predict which knob they probably turned. You start to read inference papers structurally instead of mystically. You stop nodding at the slide and start asking what shape, what stride, what representation. You stop being surprised when someone claims a 3× speedup from changing nothing but the loop order, because you now know that changing the loop order is changing what the cache sees.
That, more than any one optimization, is what writing the matmul by hand gives you. It gives you a vocabulary for talking about performance that is grounded in something more durable than benchmark numbers — the actual cost model of the machine you are running on.
— Read alongside: Layer 2 — Linear Algebra.