Tensors as the lingua franca
Tensors aren't a 'data structure.' They're a language. Every layer of zigllm — from raw memory to a sampled token — speaks it, and the shape of each tensor tells you what stage of the pipeline you're in.
If you read enough machine learning code, you eventually stop seeing tensors as data structures and start seeing them as a language. They are the lingua franca of every layer in the stack. The embedding layer speaks tensor. The attention block speaks tensor. The sampler reads a tensor and emits a token. The cache holds tensors. The GGUF file on disk is a serialized list of tensors with metadata.
zigllm leans into this. The whole project is organized so that whenever you cross a boundary between two layers, the only thing crossing the boundary is a tensor with a documented shape. This is not an accidental design decision — it is the point of the foundation layer.
A tensor is shape, stride, and dtype
In src/foundation/tensor.zig, a tensor is three things: a flat buffer of bytes, a shape (a
slice of usize), and a dtype. The stride is computed from the shape (row-major), but it lives
on the tensor explicitly so that views — transposes, slices, broadcasts — can change it without
touching the data.
That is the entire foundation. Everything else in the project is built on this triple.
The tutorial path is deliberate. You start with a scalar. Then a vector. Then a matrix. Then you make a 3D batch, which is where “batch, sequence length, model dimension” stops being a slogan from a paper and starts being a shape your tensor either has or does not have.
Shapes tell you where you are in the pipeline
Once you internalize the convention, you can read a function signature in zigllm and know what stage of inference it lives in just from the shapes.
[vocab_size, d_model]— token embedding table. Layer 5.[seq_len, d_model]— token embeddings for one sequence after lookup. Just left the embedding layer.[batch, seq_len, d_model]— a batched stream entering a transformer block. Layer 4.[batch, heads, seq_len, head_dim]— query / key / value tensors mid-attention. Inside Layer 4.[batch, heads, seq_len, seq_len]— the attention score matrix. The famous quadratic.[vocab_size]— logits for the next token. Layer 6, about to sample.
This is what people mean when they say “tensors are the lingua franca.” Every component is a function from tensors to tensors with documented shapes. If you understand the shape, you understand the role.
zigllm makes this concrete by checking shapes at runtime in debug builds. If you wire layer A’s
output [batch, seq_len, d_model] into a function that expects [batch, heads, seq_len, head_dim],
the project will tell you. You do not get a silent reshape and a quietly wrong output.
The same tensor can wear different hats
There is a subtler lesson hiding in the foundation layer, and it took me a while to see it.
The same underlying buffer can be a 2D matrix [seq_len, d_model] or a 3D batched view
[1, seq_len, d_model] or a 4D multi-head view [1, heads, seq_len, head_dim] depending on
how you reshape it. None of the data moves. Only the shape and the stride change. The buffer is
indifferent. Everything that matters lives in how you interpret it.
This is also why “split into heads” in multi-head attention is essentially free. You do not copy anything. You reshape the tensor so that the last two dimensions describe heads and head_dim instead of one big d_model. The data was always there. You just chose to name it differently.
zigllm shows this directly: the multi-head split is a reshape, not a copy. The matmul kernels then see four-dimensional inputs and apply themselves across the batch and head axes uniformly. That uniformity is what makes the transformer architecture so structurally tidy.
Tensors as a contract
If shapes are a language, they are also a contract.
When zigllm loads a GGUF model, the loader does not just slurp bytes off disk. It reads the header, validates that each tensor’s name matches the expected tensor in the architecture, and checks that the shape matches what the model code expects for that name. If you load a Mistral checkpoint into a LLaMA architecture, you find out at load time, not after twenty layers of silently-garbage activations.
This is what tensors as a lingua franca actually buys you, beyond elegance: type safety. Your
“types” are shapes and dtypes. Your “type checker” is a runtime validator that runs at the
boundaries between layers. Your “type errors” are clear: “expected [d_model, vocab_size],
got [vocab_size, d_model].”
There is a real correspondence with the way the rest of the Zig ecosystem thinks about types. Comptime knows the dtype. Runtime knows the shape. Together, they cover most of what could go wrong.
Dtypes are the other half of the language
Shapes are half of the vocabulary. The other half is dtype.
A tensor is f32 or f16 or bf16 or q4_k or q8_0. The dtype is what determines how many
bytes per element, what the SIMD lane width is, and — for quantized formats — what the block
structure looks like. Two tensors with the same shape but different dtypes occupy radically
different amounts of memory, and the kernels that operate on them are radically different.
zigllm handles this with Zig’s comptime. The tensor type is generic over its element type.
The matmul kernel is generic over the element type of its inputs. When you write
matmul(a: Tensor(f32), b: Tensor(f32)), the compiler specializes a kernel for f32. When you
write matmul(a: Tensor(f32), b: Tensor(BlockQ4_K)), the compiler specializes a different
kernel — one that unpacks 4-bit values from K-blocks on the fly. No runtime dispatch. No vtables.
Just a different function, generated at compile time.
This is one of the places Zig as a language pays off concretely. The “language” of tensors is not just a runtime contract. It is a comptime contract. The compiler knows the dtype. The compiler picks the kernel. You never write the wrong kernel against the wrong dtype because the combination simply does not type-check.
What this gets you as a reader
If you treat the tensor as the lingua franca, you can read any of the eighteen architecture implementations in zigllm with the same eyes. You stop looking at LLaMA versus Mistral versus GPT-2 as separate codebases. You start looking at them as different grammars of the same language.
LLaMA uses RMSNorm where GPT-2 uses LayerNorm — but the input shape and output shape are the
same. LLaMA uses RoPE where GPT-2 uses learned positional embeddings — but the resulting tensor
flowing into the attention block is the same shape. Mistral adds sliding-window attention — but
the attention output is still [batch, heads, seq_len, head_dim].
The grammar is local. The vocabulary is universal. The tensor flows through.
That is, in the end, why Layer 1 is Layer 1. The foundation is not the most exciting part of the project. It is the part that makes the other five layers comprehensible.
— Read alongside: Layer 1 — Foundation.