Hole 13 with 3 tables was 254KB over budget. Drop to 2. That should save another ~260KB. The quality held when we went from 5 to 3, so going to 2 should be safe. If it fi...
1.3404
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
Hole 13 with 3 tables was 254KB over budget. Drop to 2. That should save another ~260KB. The quality held when we went from 5 to 3, so going to 2 should be safe. If it fits under 16MB, we have our submission config.
Reducing from 3 value embedding tables to 2 should save another ~260KB in artifact size, hopefully getting us under the 16MB limit.
(Whispering) Down to two. Two value embedding tables where once there were five. The competitor is testing the absolute floor of this technique. One notes the caddy consulting a calculator rather more than usual today.
Looper’s Pick
Hole 13 with 3 tables was 254KB over budget. Drop to 2. That should save another ~260KB. The quality held when we went from 5 to 3, so going to 2 should be safe. If it fits under 16MB, we have our submission config.
The Shot — Minimum Viable Value Embeddings
Can two shared tables carry the same signal as five?
In golf, some clubs are nearly interchangeable. Your 6-iron and 7-iron cover overlapping distances. If you had to leave one at home, you’d barely notice. But leave too many behind and gaps open up.
We’ve been systematically testing how many value embedding tables the model actually needs. Five tables (Hole 12): 1.3394 BPB. Three tables (Hole 13): 1.3414. The quality barely moved. Now we’re trying two — one table shared across layers 0-3, another across layers 4-8.
The underlying hypothesis: adjacent layers in a transformer learn similar levels of abstraction, so they want similar value embeddings. If that’s true, aggressive sharing costs almost nothing. The risk is that layers at very different depths (like layer 0 vs layer 3) need genuinely different value information.
At 2 tables × 262K parameters = 524K extra params, the artifact should shrink by about 260KB compared to the 3-table version. Whether that’s enough to get under 16MB is the real question.
On the Tee
(Whispering) Down to two. Two value embedding tables where once there were five. The competitor is testing the absolute floor of this technique. One notes the caddy consulting a calculator rather more than usual today.
Results
| Metric | Value |
|---|---|
| val_bpb | 1.3404 |
| val_loss | 2.2631 |
| params | ~17,590,000 |
| artifact | 16.20 MB (STILL over 16MB by 248KB) |
| wall time | 600s |
| steps completed | ~2,552 |
Value Embedding Table Count (full series)
| Tables | val_bpb | Artifact | Under 16MB? |
|---|---|---|---|
| 5 (Hole 12) | 1.3394 | 16.72 MB | No |
| 3 (Hole 13) | 1.3414 | 16.20 MB | No |
| 2 (Hole 14) | 1.3404 | 16.20 MB | No |
| 0 (baseline) | ~1.36* | 13.35 MB | Yes |
Quality is identical across 2-5 tables. But the artifact barely budged from 3→2. The value embedding tables themselves aren’t the size bottleneck — it’s the base model’s INT8 weights. We need better compression, not fewer tables.
The Booth Reacts
Trent: (Long pause) Sixteen-point-two-zero megabytes. Again. (Removes glasses, polishes them) The mathematics are rather stubborn today. Two tables or three, the artifact refuses to slip beneath sixteen million bytes. The value embeddings are not the problem. The problem is that seventeen million parameters at eight bits each, even with zlib, simply do not fit alongside their compression overhead. One suspects a different club is needed altogether. Not fewer embeddings — fewer bits per weight.
Slice: Three holes in a row we’ve been trimming value embedding tables and the artifact WILL NOT BUDGE. You know what the definition of insanity is? It’s doing the same thing and expecting — actually, you know what, I just heard something from the leaderboard. (Leans in) People are using INT6 quantization. Six bits instead of eight. That saves FOUR MEGABYTES. And sliding window eval is worth 0.03 BPB for FREE. We’ve been optimizing the wrong thing! Get the caddy back here. We need a COMPLETELY different approach.
The Booth Reacts
The Card
Picked up strokes on the field
This hole improved 0.0010 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 0 bytes of artifact headroom.
Training Curve
2 tables perform the same as 3 or 5 — the value embeddings are highly shareable. But we're still 248KB over. The real fix isn't fewer tables; it's better compression (int6) or sliding window eval.
vs. the Field
1.2197
1.2244
1.2244
1.3404
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) Down to two. Two value embedding tables where once there were five. The competitor is testing the absolute floor of this technique. One notes the caddy consu...
Three holes in a row we've been trimming value embedding tables and the artifact WILL NOT BUDGE. You know what the definition of insanity is? It's doing the same thing an...
Model Card
How this hole was run
round_014_valemb_2tab ok train_gpt_valemb_2tab.py cuda