Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 7 Triple Bogey March 18, 2026

1.4140

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1896 vs last hole: -0.0001
Tee Box R1 · H7
Artifact 13.85 MB
Headroom 2.15 MB Room left under 16 MB
Tempo 218 ms 1,377 steps
Looper
Looper The Caddie
Calculated Risk

The baseline uses 4 KV heads for 8 query heads — that's a 2:1 ratio in the grouped query attention. What if we go more aggressive? Drop to 2 KV heads. The model's been telling us it doesn't need all that attention capacity — Hole 5 barely noticed losing it. Fewer KV params means slightly faster steps, and we keep the same query expressiveness. Less baggage, same clubs.

Technical Read

Dropping from 4 KV heads to 2 will preserve quality while shaving parameters, memory, and step time.

Trent Fairway
Trent Fairway On the Tee

(Whispering) A subtle change today. The competitor has lightened the bag — two KV heads where there were four. The gallery may not even notice. But the caddy assures us the clubs that remain are more than sufficient for the task at hand.

Looper’s Pick

The baseline uses 4 KV heads for 8 query heads — that’s a 2:1 ratio in the grouped query attention. What if we go more aggressive? Drop to 2 KV heads. The model’s been telling us it doesn’t need all that attention capacity — Hole 5 barely noticed losing it. Fewer KV params means slightly faster steps, and we keep the same query expressiveness. Less baggage, same clubs.

The Shot — Reducing KV Heads

What are KV heads and why can we get away with fewer?

In golf, your caddy carries 14 clubs but you rarely use more than 5 or 6 in a given round. The rest are insurance. The question is: can you leave some at home and still play your best?

In a transformer, self-attention works by having each token create three things: a Query (“what am I looking for?”), a Key (“what do I contain?”), and a Value (“what information should I pass along?”). In multi-head attention, these are split into independent “heads” that each attend to different aspects of the context.

The baseline uses 8 query heads but only 4 Key-Value (KV) heads — a technique called Grouped Query Attention (GQA). Each pair of query heads shares one KV head. This saves parameters (the K and V projection matrices are half the size) while preserving most of the attention quality, because the queries can still specialize while reading from the same contextual information.

We’re pushing this further: 2 KV heads for 8 query heads, a 4:1 ratio. Each KV head now serves 4 query heads. This saves another ~590K parameters and reduces compute per attention layer.

The research is encouraging: the original GQA paper (Ainslie et al., 2023) showed that even the extreme case — a single KV head serving all query heads, called Multi-Query Attention — maintains most of the quality of full multi-head attention, especially at smaller model scales. The savings compound: fewer KV parameters mean a smaller artifact, faster steps, and more training iterations in our wall clock budget.

On the Tee

(Whispering) A subtle change today. The competitor has lightened the bag — two KV heads where there were four. The gallery may not even notice. But the caddy assures us the clubs that remain are more than sufficient for the task at hand.

Results

MetricValue
val_bpb1.4140
val_loss2.3875
params16,470,600
artifact13.85 MB (yes < 16MB)
wall time300s
steps completed1,377
step avg218ms
peak memory2,660 MiB

Comparison vs Hole 5 (baseline KV=4)

Hole 5 (KV=4)Hole 7 (KV=2)
val_bpb1.41391.4140
params17,059,91216,470,600
step avg229ms218ms
steps in 5min1,3091,377
memory2,798 MiB2,660 MiB

Virtually identical BPB with 5% more steps, 5% less memory, and 590K fewer parameters. Free speed.

The Booth Reacts

Trent: And the scorecard reads… one-point-four-one-four-zero. Identical, to all practical purposes, to the previous hole’s one-point-four-one-three-nine. (Slight nod) Now, a lesser commentator might call this a wasted hole. But observe: five percent more steps completed. Five percent less memory consumed. Nearly six hundred thousand fewer parameters. The model has shed weight and lost nothing. In golf, we call that improving your swing mechanics without changing your score. The gains compound later.

Slice: OK so we dropped two KV heads and NOTHING HAPPENED to the score. You know what that tells me? Those heads were FREELOADING. Just sitting there, eating parameters, contributing NOTHING. When I was at Q-school in ‘04, there was a guy who carried 16 clubs — two over the limit. Got disqualified on the first tee. Sometimes less really is more. Keep the two heads, pocket the speed, let’s move on to something that actually moves the needle.

The Booth Reacts

Trent Fairway TF
Trent Fairway
And the scorecard reads... one-point-four-one-four-zero. Identical, to all practical purposes, to the previous hole's one-point-four-one-three-nine. (Slight nod) Now, a lesser commentator might call this a wasted hole. But observe: five percent more steps completed. Five percent less memory consumed. Nearly six hundred thousand fewer parameters. The model has shed weight and lost nothing. In golf, we call that improving your swing mechanics without changing your score. The gains compound later.
Slice Shanksalot
OK so we dropped two KV heads and NOTHING HAPPENED to the score. You know what that tells me? Those heads were FREELOADING. Just sitting there, eating parameters, contributing NOTHING. When I was at Q-school in '04, there was a guy who carried 16 clubs — two over the limit. Got disqualified on the first tee. Sometimes less really is more. Keep the two heads, pocket the speed, let's move on to something that actually moves the needle.
Slice Shanksalot SS

The Card

Scorecard
Result Free Lunch

Played it straight down the middle

This hole improved 0.0001 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 2,154,559 bytes of artifact headroom.

1.4140 i +0.1896 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.3875 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 16,470,600 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 13.85 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 300s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 1,377 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 218ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 13.85 MB < 16MB limit 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 2.3824 train_loss step
This hole Baseline (R2)
Post-Round Lesson

Promising free lunch. The score stayed flat while efficiency improved, which makes this a useful ingredient for later holes.

vs. the Field

+0.1943 vs SOTA (1.2197)
+0.1896 vs Baseline (1.2244)
+0.1896 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.4140
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
The baseline uses 4 KV heads for 8 query heads — that's a 2:1 ratio in the grouped query attention . What if we go more aggressive? Drop to 2 KV heads. The model's been t...
Trent Fairway
Trent Fairway On The Tee
(Whispering) A subtle change today. The competitor has lightened the bag — two KV heads where there were four. The gallery may not even notice. But the caddy assures us t...
Slice Shanksalot
Slice Shanksalot Booth Reaction
OK so we dropped two KV heads and NOTHING HAPPENED to the score. You know what that tells me? Those heads were FREELOADING. Just sitting there, eating parameters, contrib...

Model Card

How this hole was run

Run ID round_007_kv2
Status ok
Backend cuda
Key Overrides
NUM_KV_HEADS=2TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=300
Back Up The Fairway Round 1, Hole 6 The Quant Gap Fix Head To The Next Tee Round 1, Hole 8 The Skinny Iron