Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 14 Bogey March 19, 2026

1.3404

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1160 vs last hole: -0.0010
Tee Box R1 · H14
Artifact 16.20 MB
Headroom 0.00 MB Room left under 16 MB
Tempo 235 ms 2,552 steps
Looper
Looper The Caddie
Safe Tweaks

Hole 13 with 3 tables was 254KB over budget. Drop to 2. That should save another ~260KB. The quality held when we went from 5 to 3, so going to 2 should be safe. If it fits under 16MB, we have our submission config.

Technical Read

Reducing from 3 value embedding tables to 2 should save another ~260KB in artifact size, hopefully getting us under the 16MB limit.

Trent Fairway
Trent Fairway On the Tee

(Whispering) Down to two. Two value embedding tables where once there were five. The competitor is testing the absolute floor of this technique. One notes the caddy consulting a calculator rather more than usual today.

Looper’s Pick

Hole 13 with 3 tables was 254KB over budget. Drop to 2. That should save another ~260KB. The quality held when we went from 5 to 3, so going to 2 should be safe. If it fits under 16MB, we have our submission config.

The Shot — Minimum Viable Value Embeddings

Can two shared tables carry the same signal as five?

In golf, some clubs are nearly interchangeable. Your 6-iron and 7-iron cover overlapping distances. If you had to leave one at home, you’d barely notice. But leave too many behind and gaps open up.

We’ve been systematically testing how many value embedding tables the model actually needs. Five tables (Hole 12): 1.3394 BPB. Three tables (Hole 13): 1.3414. The quality barely moved. Now we’re trying two — one table shared across layers 0-3, another across layers 4-8.

The underlying hypothesis: adjacent layers in a transformer learn similar levels of abstraction, so they want similar value embeddings. If that’s true, aggressive sharing costs almost nothing. The risk is that layers at very different depths (like layer 0 vs layer 3) need genuinely different value information.

At 2 tables × 262K parameters = 524K extra params, the artifact should shrink by about 260KB compared to the 3-table version. Whether that’s enough to get under 16MB is the real question.

On the Tee

(Whispering) Down to two. Two value embedding tables where once there were five. The competitor is testing the absolute floor of this technique. One notes the caddy consulting a calculator rather more than usual today.

Results

MetricValue
val_bpb1.3404
val_loss2.2631
params~17,590,000
artifact16.20 MB (STILL over 16MB by 248KB)
wall time600s
steps completed~2,552

Value Embedding Table Count (full series)

Tablesval_bpbArtifactUnder 16MB?
5 (Hole 12)1.339416.72 MBNo
3 (Hole 13)1.341416.20 MBNo
2 (Hole 14)1.340416.20 MBNo
0 (baseline)~1.36*13.35 MBYes

Quality is identical across 2-5 tables. But the artifact barely budged from 3→2. The value embedding tables themselves aren’t the size bottleneck — it’s the base model’s INT8 weights. We need better compression, not fewer tables.

The Booth Reacts

Trent: (Long pause) Sixteen-point-two-zero megabytes. Again. (Removes glasses, polishes them) The mathematics are rather stubborn today. Two tables or three, the artifact refuses to slip beneath sixteen million bytes. The value embeddings are not the problem. The problem is that seventeen million parameters at eight bits each, even with zlib, simply do not fit alongside their compression overhead. One suspects a different club is needed altogether. Not fewer embeddings — fewer bits per weight.

Slice: Three holes in a row we’ve been trimming value embedding tables and the artifact WILL NOT BUDGE. You know what the definition of insanity is? It’s doing the same thing and expecting — actually, you know what, I just heard something from the leaderboard. (Leans in) People are using INT6 quantization. Six bits instead of eight. That saves FOUR MEGABYTES. And sliding window eval is worth 0.03 BPB for FREE. We’ve been optimizing the wrong thing! Get the caddy back here. We need a COMPLETELY different approach.

The Booth Reacts

Trent Fairway TF
Trent Fairway
(Long pause) Sixteen-point-two-zero megabytes. Again. (Removes glasses, polishes them) The mathematics are rather stubborn today. Two tables or three, the artifact refuses to slip beneath sixteen million bytes. The value embeddings are not the problem. The problem is that seventeen million [parameters](/parameter-golf/glossary/#parameters) at eight bits each, even with zlib, simply do not fit alongside their compression overhead. One suspects a different club is needed altogether. Not fewer embeddings — fewer bits per weight.
Slice Shanksalot
Three holes in a row we've been trimming value embedding tables and the artifact WILL NOT BUDGE. You know what the definition of insanity is? It's doing the same thing and expecting — actually, you know what, I just heard something from the leaderboard. (Leans in) People are using INT6 quantization. Six bits instead of eight. That saves FOUR MEGABYTES. And sliding window eval is worth 0.03 BPB for FREE. We've been optimizing the wrong thing! Get the caddy back here. We need a COMPLETELY different approach.
Slice Shanksalot SS

The Card

Scorecard
Result Dead End

Picked up strokes on the field

This hole improved 0.0010 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 0 bytes of artifact headroom.

1.3404 i +0.1160 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.2631 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 17,590,000 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 16.20 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 600s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 2,552 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 235ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 16.20 MB > 16MB OVER 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 1500 2000 2.3855 train_loss step
This hole Baseline (R2)
Post-Round Lesson

2 tables perform the same as 3 or 5 — the value embeddings are highly shareable. But we're still 248KB over. The real fix isn't fewer tables; it's better compression (int6) or sliding window eval.

vs. the Field

+0.1207 vs SOTA (1.2197)
+0.1160 vs Baseline (1.2244)
+0.1160 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.3404
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
Hole 13 with 3 tables was 254KB over budget. Drop to 2. That should save another ~260KB. The quality held when we went from 5 to 3, so going to 2 should be safe. If it fi...
Trent Fairway
Trent Fairway On The Tee
(Whispering) Down to two. Two value embedding tables where once there were five. The competitor is testing the absolute floor of this technique. One notes the caddy consu...
Slice Shanksalot
Slice Shanksalot Booth Reaction
Three holes in a row we've been trimming value embedding tables and the artifact WILL NOT BUDGE. You know what the definition of insanity is? It's doing the same thing an...

Model Card

How this hole was run

Run ID round_014_valemb_2tab
Status ok
Training Script train_gpt_valemb_2tab.py
Backend cuda
Key Overrides
TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=600
Back Up The Fairway Round 1, Hole 13 Trimming the Bag Head To The Next Tee Round 1, Hole 15 Reading Every Green