Hole 12 found gold but the suitcase was too big. The artifact hit 16.7MB — 700KB over the 16MB limit. The value embedding tables are the culprit: five tables at 262K para...
1.3414
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
Hole 12 found gold but the suitcase was too big. The artifact hit 16.7MB — 700KB over the 16MB limit. The value embedding tables are the culprit: five tables at 262K params each. Let's try three tables instead, sharing across layer triplets instead of pairs. That cuts ~500KB from the artifact. If the quality holds, we're close to legal.
Reducing from 5 value embedding tables to 3 should shrink the artifact enough to fit under 16MB while preserving most of the quality.
(Whispering) The competitor returns to the tee with a lighter bag. Three value embedding tables where there were five. The question is not whether the quality will hold — one rather suspects it will — but whether the arithmetic of compression will finally cooperate.
Looper’s Pick
Hole 12 found gold but the suitcase was too big. The artifact hit 16.7MB — 700KB over the 16MB limit. The value embedding tables are the culprit: five tables at 262K params each. Let’s try three tables instead, sharing across layer triplets instead of pairs. That cuts ~500KB from the artifact. If the quality holds, we’re close to legal.
The Shot — Fewer Value Embedding Tables
How much sharing can value embeddings tolerate?
In golf, you can share a caddy between two players in a casual round. Three players sharing one caddy is a stretch — the advice gets thinner, the reads get slower. But a great caddy can still help three players better than no caddy at all.
Value embeddings face the same sharing trade-off. In Hole 10, we used 5 tables shared across layer pairs (layers 0-1, 2-3, 4-5, 6-7, 8). Each pair got its own dedicated value embedding. Now we’re trying 3 tables shared across triplets (layers 0-2, 3-5, 6-8). Each table serves more layers, which means the embeddings can’t specialize as much for each layer’s specific needs.
The key question: were the 5 tables actually specializing, or were some of them learning redundant information? If adjacent layers want similar value embeddings anyway (which is plausible — nearby layers in a transformer tend to capture similar levels of abstraction), then sharing across triplets costs very little quality while saving ~500KB in the compressed artifact.
The savings come from having 3 × 262,144 = 786K params instead of 5 × 262,144 = 1.3M params. At INT8 + zlib, that’s roughly 500KB of compressed artifact size — enough to potentially bring us under the 16MB competition limit.
On the Tee
(Whispering) The competitor returns to the tee with a lighter bag. Three value embedding tables where there were five. The question is not whether the quality will hold — one rather suspects it will — but whether the arithmetic of compression will finally cooperate.
Results
| Metric | Value |
|---|---|
| val_bpb | 1.3414 |
| val_loss | 2.2649 |
| params | ~17,850,000 |
| artifact | 16.20 MB (STILL over 16MB by 254KB) |
| wall time | 600s |
| steps completed | 2,552 |
| step avg | 235ms |
Value Embedding Table Count Comparison (10-min runs)
| Tables | val_bpb | Artifact | Under 16MB? |
|---|---|---|---|
| 5 (Hole 12) | 1.3394 | 16.72 MB | No (-720KB) |
| 3 (Hole 13) | 1.3414 | 16.20 MB | No (-254KB) |
| 2 (next) | ??? | ~15.9 MB? | Hopefully |
Quality essentially identical (0.002 BPB difference is noise). But still over budget. Need one more trim.
The Booth Reacts
Trent: One-point-three-four-one-four. (Nods approvingly) Virtually indistinguishable from the five-table version. The two discarded tables were, as one suspected, largely ornamental. However. (Adjusts glasses) Sixteen-point-two megabytes. Still over the line. Two hundred and fifty-four kilobytes over, to be precise. One more trim, one imagines, and we shall finally be within the ropes.
Slice: Two tables doing NOTHING and we were carrying them around like dead weight! Classic over-packing. But look — we’re SO close. 254KB. That’s like being 254 yards from the green on a par 5. One more good shot and we’re on the dance floor. Drop to two tables. If the quality holds again — and I bet it does — we’ve got a legal artifact AND a 1.34 BPB. That’s a card I’d sign.
The Booth Reacts
The Card
Dropped a shot versus the last hole
This hole lost 0.0020 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 0 bytes of artifact headroom.
Training Curve
3 tables perform the same as 5 — the extra two weren't pulling their weight. But still 254KB over budget. Need to trim further.
vs. the Field
1.2197
1.2244
1.2244
1.3414
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) The competitor returns to the tee with a lighter bag. Three value embedding tables where there were five. The question is not whether the quality will hold —...
Two tables doing NOTHING and we were carrying them around like dead weight! Classic over-packing. But look — we're SO close. 254KB. That's like being 254 yards from the g...
Model Card
How this hole was run
round_013_valemb_slim ok train_gpt_valemb_slim.py cuda