Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 16 Bogey March 19, 2026

1.3286

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1042 vs last hole: +0.0231
Tee Box R1 · H16
Artifact 12.62 MB
Headroom 3.38 MB Room left under 16 MB
Tempo 236 ms 2,541 steps
Looper
Looper The Caddie
Calculated Risk

Four holes in a row we've been over the 16MB limit. The value embeddings earn their keep but the artifact won't cooperate. The fix isn't fewer parameters — it's fewer bits per parameter. INT6 quantization: 63 levels instead of 255. Every weight gets rounded to a coarser grid. That sounds bad, but the magic is in the compression: zlib LOVES low-entropy data, and 63 unique values compress way better than 255. The leaderboard leaders are all using this trick. Time we did too.

Technical Read

INT6 quantization (63 levels instead of 255) should compress much better with zlib, finally getting the artifact under 16MB at the cost of some quality.

Trent Fairway
Trent Fairway On the Tee

(Whispering) And finally — finally — the competitor addresses the elephant that has been standing patiently on the fairway for four consecutive holes. The artifact size. INT6 quantization. Sixty-three levels where there were two hundred and fifty-five. The bag gets lighter. The question is how much skill goes with it.

Looper’s Pick

Four holes in a row we’ve been over the 16MB limit. The value embeddings earn their keep but the artifact won’t cooperate. The fix isn’t fewer parameters — it’s fewer bits per parameter. INT6 quantization: 63 levels instead of 255. Every weight gets rounded to a coarser grid. That sounds bad, but the magic is in the compression: zlib LOVES low-entropy data, and 63 unique values compress way better than 255. The leaderboard leaders are all using this trick. Time we did too.

The Shot — INT6 Quantization

Why does reducing precision from 8 bits to 6 bits help so much with compression?

Imagine you’re packing a suitcase. With 255 different items (INT8), every pocket is unique — the zipper can’t find patterns to exploit. But with only 63 items (INT6), there’s far more repetition: the same few values appear over and over. A good compressor like zlib exploits exactly this kind of repetition.

Standard INT8 quantization maps each weight to one of 255 levels (-127 to +127). After zlib compression, this gives roughly 4-5x compression from the raw tensor bytes. INT6 maps to only 63 levels (-31 to +31). The weights are stored as regular int8 bytes (there’s no native 6-bit type), but since only 63 of the 256 possible byte values are used, zlib’s dictionary coding can represent each value in fewer bits.

The result in practice: our artifact went from 16.71 MB (INT8, over the limit) to 12.68 MB (INT6, 3.3MB under the limit). That’s a 24% reduction in compressed size.

The cost: each weight has less precision. Instead of 255 possible values per scale factor, we have 63. This introduces more rounding error during the quantization step. Our BPB went from 1.3055 (INT8 + sliding window) to 1.3286 — a 0.023 BPB degradation.

But here’s the key strategic insight: we now have 3.3MB of headroom. That’s enough for ~3 million additional parameters at INT6 compression rates. The leaderboard leaders use INT6 specifically to unlock bigger models — like 3x MLP width — that more than compensate for the per-weight precision loss. We took a small step back in quality to take a large step forward in capacity.

On the Tee

(Whispering) And finally — finally — the competitor addresses the elephant that has been standing patiently on the fairway for four consecutive holes. The artifact size. INT6 quantization. Sixty-three levels where there were two hundred and fifty-five. The bag gets lighter. The question is how much skill goes with it.

Results

MetricValue
val_bpb1.3286
val_loss2.2433
params~18,380,000
artifact12.68 MB (3.3MB under 16MB!)
wall time600s
steps completed~2,541

INT8 vs INT6

Quantval_bpbArtifactUnder 16MB?Headroom
INT8 (Hole 15)1.305516.71 MBNo-710 KB
INT6 (Hole 16)1.328612.68 MBYes+3.32 MB

Lost 0.023 BPB but gained 3.3MB of headroom. This is the enabling technique for everything that follows.

The Booth Reacts

Trent: (Visible relief) Twelve-point-six-eight megabytes. Ladies and gentlemen, after four holes of anguished arithmetic, the artifact is finally — finally — beneath the sixteen-megabyte ceiling. And not by a whisker, mind you. By three-point-three megabytes. (Adjusts tie) Yes, the BPB has risen by twenty-three thousandths versus INT8. But one now has room. Room for wider layers, deeper architectures, additional parameters. This is not a retreat. This is building the runway for the final approach.

Slice: TWELVE POINT SIX EIGHT! We went from 16.7 — OVER the line, DQ’d, go home, thanks for playing — to 12.7 with room to SPARE! That’s not a compression trick, that’s a MAGIC trick! And yeah, we gave back 0.023 BPB. You know what 0.023 BPB buys you? NOTHING compared to what 3.3 megabytes of headroom buys you. We can put three MILLION more parameters in this thing now. The leaderboard leaders? They run 3x MLP width. You know why? Because INT6 gives them the ROOM. We’re finally playing the same game they’re playing. (Slams table) Now. Let’s USE that room.

The Booth Reacts

Trent Fairway TF
Trent Fairway
(Visible relief) Twelve-point-six-eight megabytes. Ladies and gentlemen, after four holes of anguished arithmetic, the artifact is finally — finally — beneath the sixteen-megabyte ceiling. And not by a whisker, mind you. By three-point-three megabytes. (Adjusts tie) Yes, the BPB has risen by twenty-three thousandths versus INT8. But one now has room. Room for wider layers, deeper architectures, additional parameters. This is not a retreat. This is building the runway for the final approach.
Slice Shanksalot
TWELVE POINT SIX EIGHT! We went from 16.7 — OVER the line, DQ'd, go home, thanks for playing — to 12.7 with room to SPARE! That's not a compression trick, that's a MAGIC trick! And yeah, we gave back 0.023 BPB. You know what 0.023 BPB buys you? NOTHING compared to what 3.3 megabytes of headroom buys you. We can put three MILLION more parameters in this thing now. The leaderboard leaders? They run 3x MLP width. You know why? Because INT6 gives them the ROOM. We're finally playing the same game they're playing. (Slams table) Now. Let's USE that room.
Slice Shanksalot SS

The Card

Scorecard
Result Free Lunch

Dropped a shot versus the last hole

This hole lost 0.0231 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 3,375,392 bytes of artifact headroom.

1.3286 i +0.1042 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.2433 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 18,380,000 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 12.62 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 600s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 2,541 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 236ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 12.62 MB < 16MB limit 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 1500 2000 2.3842 train_loss step
This hole Baseline (R2)
Post-Round Lesson

The artifact problem is solved. 12.68MB with 3.3MB of headroom. INT6 cost 0.023 BPB but unlocked legality AND room for a bigger model. This is the enabling technique for everything that follows.

vs. the Field

+0.1089 vs SOTA (1.2197)
+0.1042 vs Baseline (1.2244)
+0.1042 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.3286
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
Four holes in a row we've been over the 16MB limit. The value embeddings earn their keep but the artifact won't cooperate. The fix isn't fewer parameters — it's fewer bit...
Trent Fairway
Trent Fairway On The Tee
(Whispering) And finally — finally — the competitor addresses the elephant that has been standing patiently on the fairway for four consecutive holes. The artifact size....
Slice Shanksalot
Slice Shanksalot Booth Reaction
TWELVE POINT SIX EIGHT! We went from 16.7 — OVER the line, DQ'd, go home, thanks for playing — to 12.7 with room to SPARE! That's not a compression trick, that's a MAGIC...

Model Card

How this hole was run

Run ID round_016_int6
Status ok
Training Script train_gpt_valemb_sw_int6.py
Backend cuda
Key Overrides
TRAIN_BATCH_TOKENS=131072VAL_SW_STRIDE=256QUANT_BITS=6MAX_WALLCLOCK_SECONDS=600
Back Up The Fairway Round 1, Hole 15 Reading Every Green Head To The Next Tee Round 1, Hole 17 The Big Iron