Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 11 Triple Bogey March 19, 2026

1.4079

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1835 vs last hole: +0.0022
Tee Box R1 · H11
Artifact 14.45 MB
Headroom 1.55 MB Room left under 16 MB
Tempo 221 ms 1,359 steps
Looper
Looper The Caddie
Calculated Risk

We’ve got two clubs the caddie likes: value embeddings from Hole 10 and KV=2 from Hole 7. One buys quality, the other buys tempo. In a perfect world, we stack them and stroll into the clubhouse with both gains at once. In the real world, architecture tweaks love to get precious about each other. Let’s find out which version of golf this is.

Technical Read

Combining value embeddings (Hole 10 win) with KV=2 (Hole 7 free lunch) should stack both gains — better quality plus faster steps.

Trent Fairway
Trent Fairway On the Tee

(Whispering) A delicate bit of overconfidence, perhaps. The competitor is combining two earlier wins in one swing: value embeddings from Hole 10 and reduced KV heads from Hole 7. In golfing terms, it is rather like changing both your club and your grip after a birdie. The ingredients have pedigree. Whether the recipe does is another matter entirely.

Looper’s Pick

We’ve got two clubs the caddie likes: value embeddings from Hole 10 and KV=2 from Hole 7. One buys quality, the other buys tempo. In a perfect world, we stack them and stroll into the clubhouse with both gains at once. In the real world, architecture tweaks love to get precious about each other. Let’s find out which version of golf this is.

The Shot — Combining Architectural Wins

Why wouldn't two improvements simply add up?

In golf, a new driver and a new putting grip might each save you a stroke independently. But they don’t necessarily save you two strokes together — the driver change might alter your approach angles, making the putting improvement less relevant on the shots you actually face.

Architectural changes in neural networks interact the same way. Value embeddings add extra information to the attention value signal. Reducing KV heads from 4 to 2 halves the dimensionality of that value signal — from 256 to 128 dimensions. So the value embeddings that worked at 256 dimensions now have to squeeze into 128 dimensions, which means less room for the extra information they carry.

On the speed side, the combination does stack: 221ms per step vs 236ms with just value embeddings (6% faster, ~85 more steps in 5 minutes). But the quality loss from the smaller value space partially erodes the value embedding gains.

This is a common pattern in model optimization: improvements are rarely perfectly additive. Each change reshapes the loss landscape, and the next change operates on a different surface than the one it was tested on. The discipline is to test combinations empirically rather than assuming they’ll stack, and to keep the individual changes available for mixing differently later.

On the Tee

(Whispering) A delicate bit of overconfidence, perhaps. The competitor is combining two earlier wins in one swing: value embeddings from Hole 10 and reduced KV heads from Hole 7. In golfing terms, it is rather like changing both your club and your grip after a birdie. The ingredients have pedigree. Whether the recipe does is another matter entirely.

Results

MetricValue
val_bpb1.4079
val_loss2.3772
params~17,200,000
artifact14.45 MB (yes < 16MB)
wall time300s
steps completed1,359
step avg221ms

The Stack

HoleConfigBPBStep avg
5Baseline arch1.4139229ms
7+ KV=21.4140218ms
10+ Value embs (KV=4)1.4057236ms
11+ Value embs + KV=21.4079221ms

This is the sort of result every experiment notebook needs: not a disaster, not a triumph, but a firm answer. Value embeddings at KV=4 remains the better club. KV=2 bought back some pace, but it squeezed the very mechanism that made Hole 10 special.

The Booth Reacts

Trent: One-point-four-zero-seven-nine. (Measured nod) Respectable — second-best on our card, in fact — but not the glorious stack one had hoped for. The value embeddings, it seems, prefer the wider canvas of four KV heads. Compress the value space and the flourish loses some of its force. A useful answer, though. Good clubs are not always good doubles partners.

Slice: So let me get this straight. KV=2 by itself: fine. Value embeddings by themselves: best ball on the course. Put them together and suddenly everybody forgets how to coordinate? (Throws hands up) Classic neural-network behavior. Fine. Message received. Keep the value embeddings, give them the full four KV heads, and stop trying to make every good idea marry every other good idea on the first date.

The Booth Reacts

Trent Fairway TF
Trent Fairway
One-point-four-zero-seven-nine. (Measured nod) Respectable — second-best on our card, in fact — but not the glorious stack one had hoped for. The value embeddings, it seems, prefer the wider canvas of four KV heads. Compress the value space and the flourish loses some of its force. A useful answer, though. Good clubs are not always good doubles partners.
Slice Shanksalot
So let me get this straight. KV=2 by itself: fine. Value embeddings by themselves: best ball on the course. Put them together and suddenly everybody forgets how to coordinate? (Throws hands up) Classic neural-network behavior. Fine. Message received. Keep the value embeddings, give them the full four KV heads, and stop trying to make every good idea marry every other good idea on the first date.
Slice Shanksalot SS

The Card

Scorecard
Result Encouraging Miss

Dropped a shot versus the last hole

This hole lost 0.0022 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 1,550,686 bytes of artifact headroom.

1.4079 i +0.1835 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.3772 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 17,200,000 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 14.45 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 300s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 1,359 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 221ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 14.45 MB < 16MB limit 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 2.3668 train_loss step
This hole Baseline (R2)
Post-Round Lesson

Partial stack. KV=2 shrinks the value embedding dimension too, reducing their effectiveness. The speed gain (220ms vs 236ms) didn't fully compensate.

vs. the Field

+0.1882 vs SOTA (1.2197)
+0.1835 vs Baseline (1.2244)
+0.1835 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.4079
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
We’ve got two clubs the caddie likes: value embeddings from Hole 10 and KV=2 from Hole 7. One buys quality, the other buys tempo. In a perfect world, we stack them and st...
Trent Fairway
Trent Fairway On The Tee
(Whispering) A delicate bit of overconfidence, perhaps. The competitor is combining two earlier wins in one swing: value embeddings from Hole 10 and reduced KV heads from...
Slice Shanksalot
Slice Shanksalot Booth Reaction
So let me get this straight. KV=2 by itself: fine. Value embeddings by themselves: best ball on the course. Put them together and suddenly everybody forgets how to coordi...

Model Card

How this hole was run

Run ID round_011_valemb_kv2
Status ok
Training Script train_gpt_valemb.py
Backend cuda
Key Overrides
NUM_KV_HEADS=2TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=300
Back Up The Fairway Round 1, Hole 10 The Extra Set of Eyes Head To The Next Tee Round 1, Hole 12 Finishing the Swing