Preface · An educational project

Learn how LLMs work by building one in Zig.

zigllm is a book-shaped Zig codebase. It implements transformer architectures across six progressive layers — starting at raw tensors and ending at sampled text. Every component is documented to teach why it works, not just how.

Start the curriculum → Read the code on GitHub ↗

Zig 0.14+ · MIT · inference-only · 285+ tests as executable documentation

Listing 1 · a numerically-stable softmax

// Numerically-stable softmax over a row of attention scores.
pub fn softmax(scores: []f32) void {
    var max: f32 = scores[0];
    for (scores) |s| max = @max(max, s);

    var sum: f32 = 0.0;
    for (scores) |*s| {
        s.* = @exp(s.* - max); // shift for stability
        sum += s.*;
    }
    for (scores) |*s| s.* /= sum;
}

Definition

What is zigllm?

zigllm is an educational, inference-only implementation of large language models, written from scratch in Zig. It is organised like a textbook: six progressive layers take you from tensors and SIMD matrix math up through attention blocks, 18 transformer architecture families, and a full generation loop. It is designed for reading and learning — not as a production runtime — and its 285+ tests double as executable documentation.

Why build one?

Frameworks hide the math. This one shows it.

Modern LLM stacks are miles of optimized kernels behind a one-line API. You can call generate() for years and never see a matmul, a RoPE rotation, or a KV-cache write. The abstraction is the point — until you actually want to understand what the model is doing.

The problem

A forward pass is a black box. Attention, normalization, sampling, quantization — each is a word you can define but can't see executing.

The approach

Read and write each layer bottom-up in a language with no runtime and no hidden machinery. When a chapter feels handwavy, open the matching test — the tests are the proof.

Table of contents

Six layers, bottom up.

The project is organized like a textbook. Start at Layer 1 — tensors — and build the primitives you need for the next chapter. By Layer 6 you have a generation loop.

See the full curriculum →

Progressive layers: 6; Progressive layers
Architecture families: 18; Architecture families
Tests as documentation: 285+; Tests as documentation
One toolchain: Zig 0.14+; One toolchain
Licensed code: MIT; Licensed code
No GPU required: CPU; No GPU required

Chapter highlights

What's inside the book.

§5

18 architecture families

LLaMA, Mistral, GPT-2, Falcon, Qwen, Phi, GPT-J, GPT-NeoX, BLOOM, Mamba, BERT, Gemma, StarCoder, MoE, multimodal, and more — implemented as a study of how each family differs from the others.

All architectures →

§6

Sampling, in full

Greedy, top-k, top-p, temperature, Mirostat, typical, tail-free, contrastive. Grammar-constrained generation: JSON, regex, CFG. The decoding loop stops being a black box.

Read the inference layer →

§5

GGUF, end to end

Memory-mapped GGUF loading compatible with the llama.cpp ecosystem. Read a pre-trained model from disk, decode the tensors yourself, and run a forward pass.

Run your first model →

Why Zig

A language that gets out of the way.

Manual memory management makes the cost of every allocation legible. comptime generics let a single tensor type specialize to f32, f16, or quantized formats without runtime dispatch. First-class SIMD makes matmul kernels readable instead of hidden behind a framework.

The project is also evidence: Zig is a viable language for ML/AI workloads — no runtime, no garbage collector, and no Python on the hot path.

Listing 2 · the autoregressive generation loop

// Layer 6: sample one token, feed it back, repeat.
var pos: usize = 0;
while (pos < max_tokens) : (pos += 1) {
    const logits = try model.forward(token, &kv_cache);
    const next = sampler.sample(logits); // top-p, temperature…
    if (next == tokenizer.eos) break;
    try out.writeAll(tokenizer.decode(next));
    token = next;
}

Notes from the margins

Recent essays

All posts →

§6 · Inference · 2026-04-02

Open chapter one.

Clone the repo, run the tests, and follow the doc tree. The fastest path to understanding is the one you walk in code.

Getting started → GitHub ↗ More from Cognisoc ↗

Learn how LLMs work by building one in Zig.

What is zigllm?

Frameworks hide the math. This one shows it.

Six layers, bottom up.

What's inside the book.

18 architecture families

Sampling, in full

GGUF, end to end

A language that gets out of the way.

Recent essays

How a sampling loop becomes 'generation'

Tensors as the lingua franca

What you actually learn writing the matmul yourself

Open chapter one.