Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 17 Bogey March 19, 2026

1.3352

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1108 vs last hole: +0.0066
Tee Box R1 · H17
Artifact 15.40 MB
Headroom 0.60 MB Room left under 16 MB
Tempo 264 ms 2,273 steps
Looper
Looper The Caddie
Calculated Risk

We've got 3.3MB of headroom from INT6. Time to spend it. The leaderboard leaders run MLP_MULT=3 — 1536 hidden instead of 1024. That's 50% more feed-forward capacity per layer, adding ~5M parameters (from 18.4M to ~23M). In Hole 8 we learned that cutting the MLP hurts badly. Now we test the opposite: what happens when we make it bigger? INT6 compression means the artifact should still fit under 16MB.

Technical Read

INT6 freed 3.3MB of headroom. Use it for 3x MLP width (1536 hidden), adding ~5M params of pure feed-forward capacity.

Trent Fairway
Trent Fairway On the Tee

(Whispering) The competitor has reached for the biggest club in the bag. Three times the MLP width. Twenty-three million parameters where once there were eighteen. And yet — thanks to the INT6 compression from Hole 16 — the artifact fits in fifteen-point-five megabytes. One notes this is precisely the strategy employed by the current leaderboard leaders. The caddy, it seems, has been reading the field guides.

Looper’s Pick

We’ve got 3.3MB of headroom from INT6. Time to spend it. The leaderboard leaders run MLP_MULT=3 — 1536 hidden instead of 1024. That’s 50% more feed-forward capacity per layer, adding ~5M parameters (from 18.4M to ~23M). In Hole 8 we learned that cutting the MLP hurts badly. Now we test the opposite: what happens when we make it bigger? INT6 compression means the artifact should still fit under 16MB.

The Shot — Wider MLP with INT6 Headroom

Why does a wider MLP help, and how does INT6 make it affordable?

In golf, upgrading from a standard driver to an oversized one gives you a bigger sweet spot. You don’t swing harder — you just connect better more often. The MLP is the transformer’s “sweet spot” for storing knowledge, and making it wider is like making the club face bigger.

The MLP’s hidden dimension controls how much processing happens between attention steps. At MLP_MULT=2 (baseline), each 512-dimensional token representation expands to 1024 dimensions in the hidden layer — essentially giving the model a 1024-dimensional “scratch space” to think in before projecting back to 512. At MLP_MULT=3, that scratch space grows to 1536 dimensions, providing 50% more room for the model to decompose, transform, and recombine features.

This directly addresses one of the scaling laws: at fixed architecture shape, more parameters generally means lower loss, following a power law L(N) ~ N^(-0.076). Adding ~5M parameters should buy us a meaningful reduction in loss, assuming we have enough training steps for them to converge.

The catch in previous holes was that more parameters meant a larger artifact. At INT8, 23M params compressed to ~16.7MB — over the limit. But INT6 quantization changes the math: 23M params at INT6 compress to ~15.4MB, comfortably under the 16MB ceiling. This is exactly the play the top leaderboard submissions use: aggressive quantization enabling bigger models.

The trade-off is step speed. Each MLP operation is 50% larger, so step time increases from 236ms to 264ms — meaning we get 2,273 steps in 10 minutes instead of 2,541. The question is whether the extra capacity per step more than compensates for having 268 fewer steps.

On the Tee

(Whispering) The competitor has reached for the biggest club in the bag. Three times the MLP width. Twenty-three million parameters where once there were eighteen. And yet — thanks to the INT6 compression from Hole 16 — the artifact fits in fifteen-point-five megabytes. One notes this is precisely the strategy employed by the current leaderboard leaders. The caddy, it seems, has been reading the field guides.

Results

MetricValue
val_bpb (pre-quant)1.3352
val_bpb (post-roundtrip)(pod crashed during eval)
val_loss2.2544
params~23,100,000
artifact15.46 MB (under 16MB!)
wall time600s
steps completed2,273
step avg264ms

The Full Stack So Far

HoleTechniquesBPBArtifact
5Small batch only1.413913.35 MB
10+ Value embeddings1.405714.48 MB
12+ 10-min run1.339416.72 MB (over!)
15+ Sliding window eval1.305516.71 MB (over!)
16+ INT6 quantization1.328612.68 MB
17+ 3x MLP width1.3352*15.46 MB

*Pre-quantization number. The L40S pod died during sliding window eval before the post-roundtrip BPB was computed. We’re re-running this on H100 as Hole 18.

The Booth Reacts

Trent: (Studying the incomplete scorecard) One-point-three-three-five-two before the roundtrip — and then, silence. The computing apparatus expired mid-examination. (Long pause) It is rather like a golfer collapsing on the eighteenth green with a career-best round in hand but no signed scorecard. The number is tantalizing. The artifact fits. But we lack the official stamp. (Looks toward the H100 pod) One understands a rather more powerful machine is being prepared for the final hole.

Slice: The pod DIED?! We had the best architecture we’ve EVER built — 23 million parameters, INT6 compression, value embeddings, the works — and the COMPUTER GIVES UP during the eval?! (Deep breath) OK. OK fine. The pre-quant number is 1.3352. That’s BETTER than Hole 12’s pre-quant (1.3394). The artifact is 15.46MB — LEGAL. The only thing missing is the final roundtrip score with sliding window. We need the H100. We need it NOW. This is the approach. This is the club. We just need a caddy that doesn’t pass out on the back nine.

The Booth Reacts

Trent Fairway TF
Trent Fairway
(Studying the incomplete scorecard) One-point-three-three-five-two before the roundtrip — and then, silence. The computing apparatus expired mid-examination. (Long pause) It is rather like a golfer collapsing on the eighteenth green with a career-best round in hand but no signed scorecard. The number is tantalizing. The artifact fits. But we lack the official stamp. (Looks toward the H100 pod) One understands a rather more powerful machine is being prepared for the final hole.
Slice Shanksalot
The pod DIED?! We had the best architecture we've EVER built — 23 million parameters, INT6 compression, value embeddings, the works — and the COMPUTER GIVES UP during the eval?! (Deep breath) OK. OK fine. The pre-quant number is 1.3352. That's BETTER than Hole 12's pre-quant (1.3394). The artifact is 15.46MB — LEGAL. The only thing missing is the final roundtrip score with sliding window. We need the H100. We need it NOW. This is the approach. This is the club. We just need a caddy that doesn't pass out on the back nine.
Slice Shanksalot SS

The Card

Scorecard
Result Encouraging Miss

Dropped a shot versus the last hole

This hole lost 0.0066 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 595,919 bytes of artifact headroom.

1.3352 i +0.1108 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.2544 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 23,100,000 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 15.40 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 600s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 2,273 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 264ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 15.40 MB < 16MB limit 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 1500 2000 2.0968 train_loss step
This hole Baseline (R2)
Post-Round Lesson

Pre-quant BPB of 1.3352 beats our previous best architecture — the bigger MLP is working. Artifact 15.46MB fits under 16MB. Pod died during sliding window eval before we got the final roundtrip number. Re-running on H100 for Hole 18.

vs. the Field

+0.1155 vs SOTA (1.2197)
+0.1108 vs Baseline (1.2244)
+0.1108 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.3352
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
We've got 3.3MB of headroom from INT6. Time to spend it. The leaderboard leaders run MLP_MULT=3 — 1536 hidden instead of 1024. That's 50% more feed-forward capacity per l...
Trent Fairway
Trent Fairway On The Tee
(Whispering) The competitor has reached for the biggest club in the bag. Three times the MLP width. Twenty-three million parameters where once there were eighteen. And ye...
Slice Shanksalot
Slice Shanksalot Booth Reaction
The pod DIED?! We had the best architecture we've EVER built — 23 million parameters, INT6 compression, value embeddings, the works — and the COMPUTER GIVES UP during the...

Model Card

How this hole was run

Run ID round_017_mlp3
Status partial
Training Script train_gpt_valemb_sw_int6.py
Backend cuda
Key Overrides
MLP_MULT=3TRAIN_BATCH_TOKENS=131072VAL_SW_STRIDE=256QUANT_BITS=6MAX_WALLCLOCK_SECONDS=600
Back Up The Fairway Round 1, Hole 16 Fitting Through the Door Head To The Next Tee Round 1, Hole 18 The Signature Hole