§6 · Inference · 2026-04-02

How a sampling loop becomes 'generation'

Generation is not a single decision. It's a loop, a probability distribution, and a choice about how greedy or how wild to be. zigllm's Layer 6 makes the loop visible — and shows you why temperature, top-k, and top-p exist.

When people say “the model generated this,” they are gesturing at something that sounds mysterious but is, mechanically, very simple. A transformer is a function that takes a sequence of tokens and returns a probability distribution over what the next token might be. To generate, you sample from that distribution, append the sample to your sequence, and run the function again.

That is generation. That is the entire shape of it. zigllm’s Layer 6 makes this concrete in a way that, I think, no slide of “autoregressive language modeling” ever quite does.

The loop

The structure of the generation loop, stripped to its bones, is roughly:

while not done:
    logits = model.forward(tokens, kv_cache)
    next_token = sample(logits, config)
    tokens.append(next_token)
    if next_token == EOS or len(tokens) >= max_len:
        done = True

Four lines. One of them is the forward pass. One of them is the sampler. One of them is the stop condition. One of them is bookkeeping.

In zigllm, the forward pass is the entire stack you built in Layers 1–5. The sampler is the function sample(logits, config). The interesting thing about Layer 6 is that almost everything you might want to vary about “how” a model generates lives inside that one function.

What’s inside `sample`

The simplest sampler is greedy. It picks the token with the highest logit. It is deterministic. It is also, in practice, a terrible default — it produces text that repeats itself in tight loops and never surprises you. There is a reason essentially no production model defaults to greedy.

The next simplest is temperature scaling. Divide every logit by T before softmaxing. If T is less than 1, the distribution gets sharper — high-probability tokens get even more probability. If T is greater than 1, the distribution flattens — lower-probability tokens get a fairer shot. Temperature is the “how weird are you willing to be” knob.

Top-k truncates: keep only the k highest-probability tokens, renormalize, sample from those. Top-p (nucleus) is the same idea but adaptive: keep the smallest set of tokens whose cumulative probability exceeds p, renormalize, sample. The point of both is to chop off the long tail of low-probability tokens, because almost all of the bad surprises in generation come from that tail.

zigllm implements all of these — greedy, top-k, top-p, temperature, plus Mirostat, typical sampling, tail-free sampling, and contrastive decoding. Reading them next to each other is clarifying. They are all the same shape: take a vector of logits, optionally adjust it, optionally truncate it, sample one index.

Why so many samplers?

Because every one of them is a different theory of what “the long tail” means.

Top-k says: “Keep a fixed number of candidates, regardless of how peaked the distribution is.” This is bad when the distribution is very peaked (you keep nine tokens that have basically zero probability) and bad when the distribution is very flat (you cut off plausible candidates).

Top-p says: “Keep a probabilistically meaningful slice of the distribution, however many tokens that takes.” This adapts to peaked and flat distributions. It is the most common default in modern systems for good reason.

Mirostat says: “Forget the geometric shape of the distribution. Just keep the perplexity of the output near a target.” It is a feedback controller. You tell it how surprising you want the text to feel, on average, and it adjusts its truncation to maintain that.

Reading these implementations next to each other — they all live in src/inference/ — turns “which sampler should I use” from a folk-wisdom question into a structural one. You can see exactly what each one optimizes for.

The grammar-constrained corner

zigllm also implements grammar-constrained generation: JSON, regex, and full context-free grammars. This is the part of Layer 6 that is, frankly, the most surprising on first read.

The mechanism is this: instead of sampling from all valid tokens, you mask out the logits of any token that would, if appended to the current output, violate the grammar. The masked tokens get their logit set to -inf before softmax. The model never even sees them as possibilities.

What this means in practice is that you can force a model to emit valid JSON without fine-tuning it. The sampler refuses to consider tokens that would break the grammar. The model itself does not know — from its perspective, the universe of plausible next tokens just happens to align with the grammar at every step.

It is the cleanest demonstration I know of that generation is not a property of the model. It is a property of the loop. You can change the way text is generated without retraining anything. You just change the rules of the sampler.

KV caching, briefly

The other thing Layer 6 makes inescapable is the KV cache. On every step of the loop, the model needs to attend to every previous token. Naively, this means redoing all the attention work for the entire prefix every time you generate a new token. With KV caching, you store the keys and values you have already computed and only compute the new ones — turning a quadratic into a linear scan, with the 20× speedup the README quotes.

The cache is also where most of the engineering complexity of production inference lives: sliding windows, paged attention, prefix sharing across requests. zigllm doesn’t implement all of those, but Layer 6 has the basic cache, and reading it gives you the right mental model for why the more advanced techniques exist.

Streaming, and why it matters for understanding

The other thing Layer 6 implements that I think is underrated, pedagogically, is streaming. A streaming generator does not return the full output string at the end. It returns tokens (or characters) as they are produced.

That sounds like a UX detail, and at the surface it is one. But reading the streaming implementation in zigllm clarifies something structural: generation is truly incremental. Token N+1 is decided before token N+2 even has a representation. There is no global plan. There is no “the model wrote this sentence.” There is a loop that produces one token at a time, conditioned on everything that came before.

This is the closest you can get to seeing autoregression with your own eyes. The streaming loop is the autoregressive structure made interactive. When the cursor blinks token by token, that is not animation. That is the loop, running.

What “the model generated this” actually means

After reading Layer 6, the phrase “the model generated this” stops being a single act and starts being a recursion. The model produces a distribution. Your loop turns that distribution into a token. That token modifies the state. The next call produces a new distribution. Repeat until a stop condition.

The model’s role is the same in every iteration — produce a probability distribution. The sampler’s role is to make a choice. The style of the text — its randomness, its conservatism, its structural validity — is decided not by the model but by the loop around it.

That is the lesson of Layer 6, and the reason the inference layer is on top of the stack rather than buried inside Layer 5: “generation” is something you do to a model, not something a model does.

— Read alongside: Layer 6 — Inference.

← All essays Read the curriculum →