The Caddie Lexicon
Terms, tricks, and clubhouse jargon for following a parameter golf tournament without getting stranded in the rough.
Start Here
The six terms that unlock most of the site
If you only want the short version: lower BPB is better, the artifact has to fit under 16MB, training gets just 10 minutes, and most of the game is trading model capacity against compression and speed.
Metrics & Scoring
The challenge's primary scoring metric.
How many bits the model needs, on average, to predict each byte of unseen text from the FineWeb validation set. Because different tokenizers split text into different-sized pieces, BPB normalizes by raw bytes rather than tokens — making scores comparable across vocabularies. Lower is better. The baseline scores 1.2244. For reference, raw English text has about 1-2 bits of entropy per character, and the theoretical minimum (perfect prediction) would be somewhere below 1.0 BPB.
Validation Loss
val_loss, cross-entropy lossThe model's prediction error on held-out text.
Measured in nats (natural log units). When the model sees a sequence of tokens, it predicts a probability distribution over what comes next. The cross-entropy loss measures how surprised the model is by the actual next token — lower means less surprised, which means better predictions. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens_per_byte). The "validation" part means it's measured on text the model never saw during training, so it tests generalization rather than memorization.
Training Loss
train_lossThe model's prediction error on the batch it just trained on.
Same cross-entropy metric as validation loss, but measured on the training data the model just processed. Always lower than validation loss because the model has literally just seen and learned from this exact data. Useful for tracking whether the model is learning (should decrease over time) and whether the learning rate is too high (will be noisy or spike) or too low (will decrease very slowly).
Model Architecture
The neural network architecture used for language modeling.
A transformer is a stack of identical "blocks," each containing two sub-layers: a self-attention mechanism (which lets each token look at all previous tokens to understand context) and a feed-forward network (which processes each token's representation independently). Information flows through the stack sequentially — each block refines the representation from the one below. The baseline uses 9 blocks. Invented by Vaswani et al. in 2017, transformers are the foundation of GPT, Claude, and essentially all modern language models.
Self-Attention
scaled dot-product attention, SDPAThe mechanism that lets each token "look at" previous tokens.
Each token creates three vectors: a Query ("what am I looking for?"), a Key ("what do I contain?"), and a Value ("what information do I carry?"). These are just lists of numbers, computed by multiplying the token's representation by learned weight matrices. Attention then computes a weighted sum of all previous tokens' Values, where the weights are determined by how well each Key matches the current Query (via a dot product — literally multiplying the vectors element-wise and summing). This is how the model understands context — the word "bank" attends to nearby words to determine if it means a river bank or a financial bank. The computation scales quadratically with sequence length (doubling the context requires 4× the work), which is why context length is expensive.
Grouped Query Attention (GQA)
multi-query attention (MQA)A parameter-saving trick that shares Key/Value heads across Query heads.
Standard multi-head attention gives each head its own Q, K, and V projections. GQA shares the K and V projections across groups of heads — for example, 8 query heads might share 4 KV heads (a 2:1 ratio). This halves the KV parameter count without halving attention quality, because the Query heads can still specialize while sharing the same contextual information. The extreme version, multi-query attention (MQA), uses just 1 KV head for all query heads. The baseline uses 8 query heads with 4 KV heads.
MLP (Multi-Layer Perceptron) / Feed-Forward Network
FFN, feed-forward layer, multi-layer perceptronThe "thinking" layer that processes each token independently.
After attention gathers context, the MLP processes each token's representation through a two-layer neural network: expand to a wider hidden dimension, apply a nonlinearity (ReLU² in our case), then project back down. This is where the model stores and retrieves factual knowledge. The expansion ratio (MLP_MULT) controls the hidden width — at 2×, a 512-dim model expands to 1024 internally. The MLP accounts for 55% of all parameters in the baseline, making it the biggest lever for parameter reallocation.
ReLU² (Squared ReLU)
squared ReLU, ReLU squaredThe activation function used in the MLP layers.
An "activation function" is a mathematical operation applied between layers of a neural network that introduces nonlinearity — without it, stacking layers would be no better than one big matrix multiply. ReLU (Rectified Linear Unit) is the simplest: if a value is negative, set it to zero; if positive, keep it. ReLU² goes one step further and squares the surviving positive values. This produces sparser activations — most values become exactly zero, and the non-zero values are amplified, creating a strong winner-take-all effect. Other popular activation functions include GELU (Gaussian Error Linear Unit, a smooth approximation of ReLU used in GPT-2/3) and SwiGLU (a gated variant used in LLaMA and Gemma). ReLU² works as well as these more complex alternatives at this model scale while using fewer parameters per layer. It was popularized by the modded-nanogpt speedrun community.
RoPE
Rotary Position EmbeddingsHow the model knows token order.
Transformers process all tokens in parallel, so they need explicit position information. RoPE encodes position by rotating the Query and Key vectors by an angle proportional to their position in the sequence. Tokens close together have similar rotations, so their attention scores naturally reflect proximity. Unlike learned position embeddings, RoPE can theoretically generalize to longer sequences than it was trained on (with some degradation). The base frequency (ROPE_BASE=10000) controls how quickly the rotations cycle.
Token Embedding
tok_emb, embedding tableThe lookup table that converts token IDs to vectors.
A matrix of shape (vocab_size × model_dim). Each row is a learned vector representing one token. When the model sees token #42, it retrieves row 42 as the starting representation. With tied embeddings, this same table is reused at the output to convert vectors back into token probabilities. At vocab=1024 and dim=512, the embedding is 524K parameters — only 3% of the model, thanks to the small vocabulary.
Skip Connections
residual connections, U-Net skipShortcuts that help gradients flow through deep networks.
Each transformer block adds its output to its input: output = input + block(input). Without this, training deep networks is nearly impossible — the error signal (gradient) that flows backward during learning gets weaker and weaker with each layer until it effectively vanishes ("vanishing gradient problem"). The residual shortcut gives the gradient a direct highway through the entire network. The baseline also uses U-Net-style skip connections (borrowed from image segmentation) that pass activations from the first half of the network to the second half, weighted by learned parameters. This gives the final layers direct access to early representations, as if a golfer could see both the tee and the green simultaneously.
A lightweight normalization layer.
As numbers flow through a neural network, they can drift to wildly different scales — some very large, some near zero. This makes training unstable. Normalization layers fix this by rescaling values to a consistent range. RMSNorm does this by dividing each vector by its root-mean-square (square each element, take the average, then take the square root). It's simpler and faster than the older LayerNorm (which also subtracts the mean and adds learnable scale/shift parameters). Applied before attention and before the MLP in each block. The baseline uses RMSNorm without learnable parameters — just the normalization itself.
Depth Recurrence
looped transformer, weight sharingReusing the same transformer blocks multiple times.
Instead of 9 unique blocks, use 3 unique blocks and loop through them 10 times for 30 effective layers at the parameter cost of 3. This massively increases depth-per-parameter. However, raw weight sharing underperforms — PR #31 got worse-than-baseline results (1.2663 BPB). The Relaxed Recursive Transformers paper shows you need small per-loop LoRA adapters to let each iteration specialize slightly. Think of it as the same golf hole played 10 times, but you adjust your stance each time.
BPE (Byte Pair Encoding)
SentencePiece BPE, subword tokenizationThe algorithm used to build our vocabulary.
A tokenizer splits raw text into "tokens" — the atomic units the model sees. BPE (Byte Pair Encoding) builds a vocabulary by starting with individual characters, then repeatedly merging the most common adjacent pair into a new token. After enough merges, common words like "the" become single tokens while rare words get split into pieces (e.g., "tokenization" might become "token" + "ization"). The vocabulary size controls how many merges to perform: SP1024 uses 1,024 tokens (very aggressive splitting, many tokens per word), while SP4096 uses 4,096 (fewer, longer tokens). Larger vocabularies compress text better (fewer tokens to represent the same text) but cost more embedding parameters. SentencePiece is the specific BPE implementation used here — it operates on raw text without pre-tokenization, making it language-agnostic.
A lightweight way to specialize shared weights.
Instead of storing a full weight matrix for each variation, LoRA stores a small "adapter" — two skinny matrices that multiply together to form a low-rank correction. For example, instead of a 512×512 matrix (262K params) per loop iteration, a rank-4 LoRA uses a 512×4 and 4×512 matrix (4K params total) — a 65× reduction. The full weight is: shared_base + adapter_A × adapter_B. This is critical for depth recurrence: the shared blocks provide the base capability, while tiny per-loop LoRA adapters let each iteration specialize. Without LoRA, pure weight sharing underperforms; with it, recurrence becomes competitive.
Training & Optimization
The optimizer used for matrix-shaped parameters.
An "optimizer" is the algorithm that decides how to adjust model weights after computing the error gradient. Adam, the industry standard, maintains running averages of gradients and their squares to adapt step sizes per-parameter. Muon takes a different approach for matrix-shaped parameters (the big weight matrices that dominate the model). It: (1) computes the gradient, (2) applies Nesterov momentum — a "look-ahead" trick that anticipates where the gradient is heading, (3) orthogonalizes the update using Newton-Schulz iteration, which ensures the update is balanced across all dimensions rather than favoring a few. This balanced update is why Muon converges 10-15% faster than Adam for transformer weight matrices. Created by Keller Jordan for the NanoGPT speedrun community. Small vector/scalar parameters still use Adam since orthogonalization only makes sense for matrices.
The math that makes Muon work.
Imagine you want to push a ball downhill (toward lower loss), but the hill is steeper in some directions than others. A naive push sends you careening down the steepest slope, which may not be the most useful direction. Newton-Schulz iteration "normalizes" the push so it's equally strong in every direction — mathematically, it finds the nearest orthogonal matrix to the gradient update. An orthogonal matrix preserves lengths and angles, so no single weight dimension gets an outsized update. Muon uses a fast 5-step polynomial approximation (with specific coefficients 3.4445, -4.7750, 2.0315) that requires just two matrix multiplications per step. The result: smoother, more balanced training.
Learning Rate
LR, MATRIX_LR, SCALAR_LRHow big each weight update step is.
Controls the magnitude of gradient updates — how big a step the model takes each time it adjusts its weights. Too high: the model overshoots and oscillates (like swinging too hard and sending the ball past the green). Too low: it converges slowly and may get stuck in a mediocre solution (leaving the ball short). The baseline uses different learning rates for different parameter types: MATRIX_LR=0.04 for Muon-optimized weight matrices, SCALAR_LR=0.04 for small scalar parameters via Adam, and TIED_EMBED_LR=0.05 for the tied embedding table. The optimal learning rate depends heavily on batch size, model scale, and available training steps.
Warmdown
learning rate decay, cooldownGradually reducing the learning rate at the end of training.
During the last WARMDOWN_ITERS steps of training, the learning rate linearly decays toward zero. This helps the model settle into a sharp minimum rather than bouncing around one. With the wall-clock time limit, warmdown is adaptive — it estimates remaining time and begins decaying when it calculates the warmdown period has started. Extending WARMDOWN_ITERS from 1200 to 3600 was one of PR #42's improvements.
Batch Size
TRAIN_BATCH_TOKENSHow many tokens are processed per training step.
Each gradient update processes a batch of tokens. Larger batches give more accurate gradient estimates (less noise) but take longer per step. Smaller batches are noisier but allow more steps per wall-clock minute. The baseline uses 524,288 tokens per step. Our Hole 5 discovery: quartering to 131,072 gave 4× more steps in the same time, beating the full-batch run despite noisier gradients. The optimal batch size depends on GPU throughput and available wall time.
Simulating a larger batch by accumulating gradients over multiple smaller forward passes.
If a full batch doesn't fit in GPU memory, you can process it in chunks ("micro-batches"), accumulate the gradients, then do one combined weight update. The baseline uses grad_accum_steps = 8 // world_size — so on a single GPU, it processes 8 micro-batches per step. This gives the same mathematical result as processing the full batch at once, just slower. On 8 GPUs, each GPU handles one micro-batch in parallel.
Compression & Quantization
The quality lost when compressing model weights.
After training, model weights are high-precision floating-point numbers (32 bits each). For the competition artifact, they get rounded to 8-bit integers (INT8) and compressed. This rounding introduces small errors. The quantization gap is the difference in val_bpb between the original model and the quantized-then-decompressed version. The baseline gap is about 0.007 BPB. Keeping the embedding table in fp16 instead of INT8 reduces this to ~0.0005 BPB.
Rounding model weights from 32-bit floats to 8-bit integers.
Each weight is scaled and rounded to an integer in [-127, 127]. A per-row scale factor (stored as fp16) allows reconstruction: original ≈ int8_value × scale. This gives ~4× compression vs float32. The baseline uses per-row INT8 for 2D weight matrices and per-tensor INT8 for vectors. After INT8, the blob is further compressed with zlib. Small tensors (<65K elements) skip INT8 and are stored directly in fp16.
Lossless compression applied after quantization.
Standard deflate compression applied at level 9 (max compression). Works well on INT8 weights because the reduced value range (256 possible values vs billions for float32) creates more repetition for the compressor to exploit. The combined INT8+zlib pipeline compresses the baseline's 67MB raw model to 15.8MB.
Training with simulated quantization so the model learns to tolerate rounding.
Normally, quantization happens after training is complete — the model is trained in high precision, then the weights are rounded down for compression. QAT does something clever: during training itself, it simulates the rounding in the forward pass (so the model "experiences" quantized weights) but keeps full-precision weights for the backward pass (so gradients remain accurate). This is done via the Straight-Through Estimator (STE) — a trick where gradients flow through the rounding operation as if it weren't there. Over time, the model learns to arrange its weights so they round cleanly, dramatically reducing the quality lost during compression. PR #38 showed an 18× reduction in quantization degradation (0.00217 → 0.00012 BPB). QAT also tends to produce more compressible weights since the model converges to values that cluster near the quantization grid points.
Ternary Weights
BitNet b1.58, 1.58-bitExtreme quantization where every weight is -1, 0, or +1.
At 1.58 bits per parameter, you can pack ~80M parameters into 16MB — compared to ~17M at INT8. Each weight is one of three values, multiplied by a learned scale factor per row. The trade-off: at small scale (under 3 billion params), ternary models have roughly half the effective capacity of full-precision models. So 80M ternary ≈ 40M "effective" parameters — still a 2.3× gain over the baseline's 17M. Requires careful training: a staged approach (train full-precision first, then convert to ternary), a lower learning rate than usual, and LayerNorm instead of RMSNorm (LayerNorm includes learnable scale and shift parameters that help compensate for the extreme quantization).
fp16 / fp32 / bf16
half precision, single precision, bfloat16Number formats for storing model weights.
Computers represent decimal numbers using "floating point" formats. fp32 (32-bit float) uses 32 bits per number, giving ~7 decimal digits of precision. fp16 (16-bit float) uses 16 bits, giving ~3.5 digits — half the memory but less precise. bf16 (bfloat16) is a compromise: 16 bits like fp16 but with the same exponent range as fp32, making it better for neural network training where the range of values matters more than fine precision. The baseline trains in bf16 for speed, keeps optimizer state in fp32 for accuracy, and exports the final artifact in INT8 for compression. Keeping the embedding table in fp16 instead of INT8 during export is one of the key optimizations (see quantization gap).
Challenge Rules
The compressed package that must fit in 16MB.
The competition artifact is: code bytes (your train_gpt.py script) + compressed model bytes (INT8+zlib). The total must be under 16,000,000 bytes (decimal 16MB, not 16 MiB = 16,777,216). No external downloads or network calls are allowed during evaluation — everything the model needs must be inside this artifact. The baseline uses 47KB of code + 15.8MB of compressed model = 15.86MB total.
The real-world time limit for training.
Training must complete in 10 minutes (600 seconds) of wall-clock time on 8×H100 SXM GPUs. This is measured from the first training step (after warmup/compilation) to the last. The model also gets a separate 10-minute budget for evaluation. The wall clock limit means you can't just train longer — you have to make each step count.
Adapting the model to each document during evaluation.
During evaluation, for each document: take the first ~500 tokens, run a few gradient steps updating a small subset of parameters (LayerNorms, gates), then evaluate on the full document with the adapted model. This lets the model learn document-specific patterns (writing style, topic vocabulary) on the fly. Restore original weights before the next document. Uses the separate 10-minute eval budget. The NanoGPT speedrun showed even ~500 tokens of adaptation meaningfully improves predictions.
Parameters
weights, model sizeThe learnable numbers that define what a model knows.
Every neural network is defined by its parameters — millions of numbers (called "weights") that are adjusted during training to minimize prediction error. When someone says a model has "17 million parameters," they mean 17 million individual numbers that were tuned by the optimizer. More parameters generally means more capacity to learn complex patterns, but also a larger artifact to compress. At INT8 quantization (1 byte per parameter), 17M parameters take ~17MB before compression. The 16MB artifact limit is ultimately a limit on how many parameters you can afford.