Four holes in a row we've been over the 16MB limit. The value embeddings earn their keep but the artifact won't cooperate. The fix isn't fewer parameters — it's fewer bit...
1.3286
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
Four holes in a row we've been over the 16MB limit. The value embeddings earn their keep but the artifact won't cooperate. The fix isn't fewer parameters — it's fewer bits per parameter. INT6 quantization: 63 levels instead of 255. Every weight gets rounded to a coarser grid. That sounds bad, but the magic is in the compression: zlib LOVES low-entropy data, and 63 unique values compress way better than 255. The leaderboard leaders are all using this trick. Time we did too.
INT6 quantization (63 levels instead of 255) should compress much better with zlib, finally getting the artifact under 16MB at the cost of some quality.
(Whispering) And finally — finally — the competitor addresses the elephant that has been standing patiently on the fairway for four consecutive holes. The artifact size. INT6 quantization. Sixty-three levels where there were two hundred and fifty-five. The bag gets lighter. The question is how much skill goes with it.
Looper’s Pick
Four holes in a row we’ve been over the 16MB limit. The value embeddings earn their keep but the artifact won’t cooperate. The fix isn’t fewer parameters — it’s fewer bits per parameter. INT6 quantization: 63 levels instead of 255. Every weight gets rounded to a coarser grid. That sounds bad, but the magic is in the compression: zlib LOVES low-entropy data, and 63 unique values compress way better than 255. The leaderboard leaders are all using this trick. Time we did too.
The Shot — INT6 Quantization
Why does reducing precision from 8 bits to 6 bits help so much with compression?
Imagine you’re packing a suitcase. With 255 different items (INT8), every pocket is unique — the zipper can’t find patterns to exploit. But with only 63 items (INT6), there’s far more repetition: the same few values appear over and over. A good compressor like zlib exploits exactly this kind of repetition.
Standard INT8 quantization maps each weight to one of 255 levels (-127 to +127). After zlib compression, this gives roughly 4-5x compression from the raw tensor bytes. INT6 maps to only 63 levels (-31 to +31). The weights are stored as regular int8 bytes (there’s no native 6-bit type), but since only 63 of the 256 possible byte values are used, zlib’s dictionary coding can represent each value in fewer bits.
The result in practice: our artifact went from 16.71 MB (INT8, over the limit) to 12.68 MB (INT6, 3.3MB under the limit). That’s a 24% reduction in compressed size.
The cost: each weight has less precision. Instead of 255 possible values per scale factor, we have 63. This introduces more rounding error during the quantization step. Our BPB went from 1.3055 (INT8 + sliding window) to 1.3286 — a 0.023 BPB degradation.
But here’s the key strategic insight: we now have 3.3MB of headroom. That’s enough for ~3 million additional parameters at INT6 compression rates. The leaderboard leaders use INT6 specifically to unlock bigger models — like 3x MLP width — that more than compensate for the per-weight precision loss. We took a small step back in quality to take a large step forward in capacity.
On the Tee
(Whispering) And finally — finally — the competitor addresses the elephant that has been standing patiently on the fairway for four consecutive holes. The artifact size. INT6 quantization. Sixty-three levels where there were two hundred and fifty-five. The bag gets lighter. The question is how much skill goes with it.
Results
| Metric | Value |
|---|---|
| val_bpb | 1.3286 |
| val_loss | 2.2433 |
| params | ~18,380,000 |
| artifact | 12.68 MB (3.3MB under 16MB!) |
| wall time | 600s |
| steps completed | ~2,541 |
INT8 vs INT6
| Quant | val_bpb | Artifact | Under 16MB? | Headroom |
|---|---|---|---|---|
| INT8 (Hole 15) | 1.3055 | 16.71 MB | No | -710 KB |
| INT6 (Hole 16) | 1.3286 | 12.68 MB | Yes | +3.32 MB |
Lost 0.023 BPB but gained 3.3MB of headroom. This is the enabling technique for everything that follows.
The Booth Reacts
Trent: (Visible relief) Twelve-point-six-eight megabytes. Ladies and gentlemen, after four holes of anguished arithmetic, the artifact is finally — finally — beneath the sixteen-megabyte ceiling. And not by a whisker, mind you. By three-point-three megabytes. (Adjusts tie) Yes, the BPB has risen by twenty-three thousandths versus INT8. But one now has room. Room for wider layers, deeper architectures, additional parameters. This is not a retreat. This is building the runway for the final approach.
Slice: TWELVE POINT SIX EIGHT! We went from 16.7 — OVER the line, DQ’d, go home, thanks for playing — to 12.7 with room to SPARE! That’s not a compression trick, that’s a MAGIC trick! And yeah, we gave back 0.023 BPB. You know what 0.023 BPB buys you? NOTHING compared to what 3.3 megabytes of headroom buys you. We can put three MILLION more parameters in this thing now. The leaderboard leaders? They run 3x MLP width. You know why? Because INT6 gives them the ROOM. We’re finally playing the same game they’re playing. (Slams table) Now. Let’s USE that room.
The Booth Reacts
The Card
Dropped a shot versus the last hole
This hole lost 0.0231 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 3,375,392 bytes of artifact headroom.
Training Curve
The artifact problem is solved. 12.68MB with 3.3MB of headroom. INT6 cost 0.023 BPB but unlocked legality AND room for a bigger model. This is the enabling technique for everything that follows.
vs. the Field
1.2197
1.2244
1.2244
1.3286
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) And finally — finally — the competitor addresses the elephant that has been standing patiently on the fairway for four consecutive holes. The artifact size....
TWELVE POINT SIX EIGHT! We went from 16.7 — OVER the line, DQ'd, go home, thanks for playing — to 12.7 with room to SPARE! That's not a compression trick, that's a MAGIC...
Model Card
How this hole was run
round_016_int6 ok train_gpt_valemb_sw_int6.py cuda