Hole 14: Two Tables, Same Story — Gradient Descent Country Club

Round 1, Hole 14 Bogey March 19, 2026

1.3404

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1160 vs last hole: -0.0010

Tee Box R1 · H14

Artifact 16.20 MB

Headroom 0.00 MB Room left under 16 MB

Tempo 235 ms 2,552 steps

Looper The Caddie

Safe Tweaks

Hole 13 with 3 tables was 254KB over budget. Drop to 2. That should save another ~260KB. The quality held when we went from 5 to 3, so going to 2 should be safe. If it fits under 16MB, we have our submission config.

Technical Read

Reducing from 3 value embedding tables to 2 should save another ~260KB in artifact size, hopefully getting us under the 16MB limit.

Trent Fairway On the Tee

(Whispering) Down to two. Two value embedding tables where once there were five. The competitor is testing the absolute floor of this technique. One notes the caddy consulting a calculator rather more than usual today.

Looper’s Pick

The Shot — Minimum Viable Value Embeddings

Can two shared tables carry the same signal as five?

In golf, some clubs are nearly interchangeable. Your 6-iron and 7-iron cover overlapping distances. If you had to leave one at home, you’d barely notice. But leave too many behind and gaps open up.

We’ve been systematically testing how many value embedding tables the model actually needs. Five tables (Hole 12): 1.3394 BPB. Three tables (Hole 13): 1.3414. The quality barely moved. Now we’re trying two — one table shared across layers 0-3, another across layers 4-8.

The underlying hypothesis: adjacent layers in a transformer learn similar levels of abstraction, so they want similar value embeddings. If that’s true, aggressive sharing costs almost nothing. The risk is that layers at very different depths (like layer 0 vs layer 3) need genuinely different value information.

At 2 tables × 262K parameters = 524K extra params, the artifact should shrink by about 260KB compared to the 3-table version. Whether that’s enough to get under 16MB is the real question.

On the Tee

Results

Metric	Value
val_bpb	1.3404
val_loss	2.2631
params	~17,590,000
artifact	16.20 MB (STILL over 16MB by 248KB)
wall time	600s
steps completed	~2,552

Value Embedding Table Count (full series)

Tables	val_bpb	Artifact	Under 16MB?
5 (Hole 12)	1.3394	16.72 MB	No
3 (Hole 13)	1.3414	16.20 MB	No
2 (Hole 14)	1.3404	16.20 MB	No
0 (baseline)	~1.36*	13.35 MB	Yes

Quality is identical across 2-5 tables. But the artifact barely budged from 3→2. The value embedding tables themselves aren’t the size bottleneck — it’s the base model’s INT8 weights. We need better compression, not fewer tables.

The Booth Reacts

Trent: (Long pause) Sixteen-point-two-zero megabytes. Again. (Removes glasses, polishes them) The mathematics are rather stubborn today. Two tables or three, the artifact refuses to slip beneath sixteen million bytes. The value embeddings are not the problem. The problem is that seventeen million parameters at eight bits each, even with zlib, simply do not fit alongside their compression overhead. One suspects a different club is needed altogether. Not fewer embeddings — fewer bits per weight.

Slice: Three holes in a row we’ve been trimming value embedding tables and the artifact WILL NOT BUDGE. You know what the definition of insanity is? It’s doing the same thing and expecting — actually, you know what, I just heard something from the leaderboard. (Leans in) People are using INT6 quantization. Six bits instead of eight. That saves FOUR MEGABYTES. And sliding window eval is worth 0.03 BPB for FREE. We’ve been optimizing the wrong thing! Get the caddy back here. We need a COMPLETELY different approach.

The Booth Reacts

Trent Fairway

(Long pause) Sixteen-point-two-zero megabytes. Again. (Removes glasses, polishes them) The mathematics are rather stubborn today. Two tables or three, the artifact refuses to slip beneath sixteen million bytes. The value embeddings are not the problem. The problem is that seventeen million [parameters](/parameter-golf/glossary/#parameters) at eight bits each, even with zlib, simply do not fit alongside their compression overhead. One suspects a different club is needed altogether. Not fewer embeddings — fewer bits per weight.

Slice Shanksalot

Three holes in a row we've been trimming value embedding tables and the artifact WILL NOT BUDGE. You know what the definition of insanity is? It's doing the same thing and expecting — actually, you know what, I just heard something from the leaderboard. (Leans in) People are using INT6 quantization. Six bits instead of eight. That saves FOUR MEGABYTES. And sliding window eval is worth 0.03 BPB for FREE. We've been optimizing the wrong thing! Get the caddy back here. We need a COMPLETELY different approach.

The Card

Scorecard

Result Dead End

Picked up strokes on the field

This hole improved 0.0010 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 0 bytes of artifact headroom.

1.3404 i +0.1160 val bpb 2.2631 i val loss 17,590,000 i params 16.20 MB i artifact 600s i wall time 2,552 i steps 235ms i step avg

0 MB 16.20 MB > 16MB OVER 16 MB

Training Curve

This hole Baseline (R2)

Post-Round Lesson

2 tables perform the same as 3 or 5 — the value embeddings are highly shareable. But we're still 248KB over. The real fix isn't fewer tables; it's better compression (int6) or sliding window eval.

vs. the Field

+0.1207 vs SOTA (1.2197)

+0.1160 vs Baseline (1.2244)

+0.1160 vs Our Best (1.2244)

SOTA
1.2197

Baseline
1.2244

Our Best
1.2244

This Hole
1.3404

← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper Caddie Notebook

Hole 13 with 3 tables was 254KB over budget. Drop to 2. That should save another ~260KB. The quality held when we went from 5 to 3, so going to 2 should be safe. If it fi...

Trent Fairway On The Tee

(Whispering) Down to two. Two value embedding tables where once there were five. The competitor is testing the absolute floor of this technique. One notes the caddy consu...

Slice Shanksalot Booth Reaction

Three holes in a row we've been trimming value embedding tables and the artifact WILL NOT BUDGE. You know what the definition of insanity is? It's doing the same thing an...

Model Card

How this hole was run

Run ID round_014_valemb_2tab

Status ok

Training Script train_gpt_valemb_2tab.py

Backend cuda

Key Overrides

TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=600

Back Up The Fairway Round 1, Hole 13 Trimming the Bag Head To The Next Tee Round 1, Hole 15 Reading Every Green