Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 15 Bogey March 19, 2026

1.3055

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.0811 vs last hole: -0.0349
Tee Box R1 · H15
Artifact 16.71 MB
Headroom 0.00 MB Room left under 16 MB
Tempo 236 ms 2,541 steps
Looper
Looper The Caddie
Safe Tweaks

The leaderboard says everyone at the top uses sliding window eval. Instead of evaluating each token with whatever context it happens to have (tokens early in a chunk get almost none), we overlap the windows so every token gets at least 768 tokens of prior context. The training doesn't change. The artifact doesn't change. It's purely an eval-time trick — and it's reportedly worth about 0.03 BPB for free.

Technical Read

Sliding window evaluation should improve BPB for free by giving every token near-maximum context, without changing training or artifact size.

Trent Fairway
Trent Fairway On the Tee

(Whispering) A quiet revolution today. The competitor has not changed a single weight, not modified a single layer. The model is precisely the same as Hole 12. What has changed is how we look at it. Sliding window evaluation. Every token, every prediction, given the fullest possible context. It's rather like discovering that the course you've been playing has been measured in kilometers rather than miles. The scores... are about to change.

Looper’s Pick

The leaderboard says everyone at the top uses sliding window eval. Instead of evaluating each token with whatever context it happens to have (tokens early in a chunk get almost none), we overlap the windows so every token gets at least 768 tokens of prior context. The training doesn’t change. The artifact doesn’t change. It’s purely an eval-time trick — and it’s reportedly worth about 0.03 BPB for free.

The Shot — Sliding Window Evaluation

What is sliding window evaluation and why is it a free improvement?

Imagine you’re reading a novel but you can only see one page at a time. At the top of each page, you’re disoriented — who was speaking? What was the context? By the bottom of the page, you’ve reoriented and your predictions are much better. Now imagine if you could overlap the pages: for each new page, you re-read the last three-quarters of the previous page first. You’re never disoriented. Every sentence gets nearly the full page of context.

That’s sliding window evaluation. In the standard approach, we split the validation text into non-overlapping chunks of 1,024 tokens. Token #0 in each chunk has zero context — the model is guessing blind. Token #512 has 512 tokens of context. Only token #1023 gets the full 1,024 tokens of context. On average, each token gets about 512 tokens of context.

With sliding window (stride=256), we advance by only 256 tokens between evaluation windows. Each new window of 1,024 tokens overlaps with 768 tokens from the previous window. We only score the last 256 tokens — the ones that get at least 768 tokens of context. This means every scored token has near-maximum context, dramatically improving prediction quality.

The trick: this changes nothing about the trained model or the compressed artifact. The model is identical. The only difference is how we evaluate it. The competition explicitly allows this — evaluation can use any sequence length and any strategy within the 10-minute eval budget. Every top submission on the leaderboard uses some form of sliding window.

The trade-off is eval speed: with stride=256 on 62M validation tokens, we need ~240K forward passes instead of ~60K. On our L40S this took 13.5 minutes (over the 10-minute eval budget), but on 8xH100 it would be much faster. Batching multiple windows together and using a smaller stride (like 64) would further improve BPB at the cost of more compute.

On the Tee

(Whispering) A quiet revolution today. The competitor has not changed a single weight, not modified a single layer. The model is precisely the same as Hole 12. What has changed is how we look at it. Sliding window evaluation. Every token, every prediction, given the fullest possible context. It’s rather like discovering that the course you’ve been playing has been measured in kilometers rather than miles. The scores… are about to change.

Results

MetricValue
val_bpb1.3055
val_loss2.2043
params~18,380,000
artifact16.71 MB (still over 16MB)
wall time600s (training) + 814s (eval)
eval stride256 tokens

Sliding Window Effect

Eval methodval_bpbDelta
Standard (Hole 12)1.3394
Sliding window stride=256 (this hole)1.3055-0.0339

Free. Zero cost to training. Zero cost to artifact. Pure eval-time improvement.

Remaining Issues

The artifact is still 16.71 MB — over the 16MB limit. And eval took 13.5 minutes on L40S (over the 10-minute eval budget). Both problems solve themselves on faster hardware, but we still need INT6 quantization or similar to get the artifact under 16MB.

The Booth Reacts

Trent: (Removing glasses in genuine surprise) One-point-three-zero-five-five. From one-point-three-three-nine-four. Thirty-four thousandths of improvement and we did not change a single weight. (Long pause) I confess I find this rather extraordinary. The model was already trained. The artifact was already compressed. And yet, simply by reading the examination paper more carefully — overlapping our windows, granting each token its due context — we have found three hundredths of a nat that were there all along. This is the golfing equivalent of discovering you’ve been scoring with the wrong par. The ball was always in the hole. We simply hadn’t looked.

Slice: (Staring at screen) Thirty-four thousandths. FOR FREE. You know what I did in Q-school ‘04 that made the difference? I didn’t change my swing, didn’t buy new clubs, didn’t hire a new instructor. I started reading my putts from BOTH sides of the hole. Same putt, better read, lower score. THAT’S what sliding window is. We had the talent all along — we just weren’t looking at the scoreboard right. (Turns to camera) And by the way, the leaderboard leaders? They’re using stride=64, not 256. We haven’t even maxed this out yet.

The Booth Reacts

Trent Fairway TF
Trent Fairway
(Removing glasses in genuine surprise) One-point-three-zero-five-five. From one-point-three-three-nine-four. Thirty-four thousandths of improvement and we did not change a single weight. (Long pause) I confess I find this rather extraordinary. The model was already trained. The artifact was already compressed. And yet, simply by reading the examination paper more carefully — overlapping our windows, granting each token its due context — we have found three hundredths of a nat that were there all along. This is the golfing equivalent of discovering you've been scoring with the wrong par. The ball was always in the hole. We simply hadn't looked.
Slice Shanksalot
(Staring at screen) Thirty-four thousandths. FOR FREE. You know what I did in Q-school '04 that made the difference? I didn't change my swing, didn't buy new clubs, didn't hire a new instructor. I started reading my putts from BOTH sides of the hole. Same putt, better read, lower score. THAT'S what sliding window is. We had the talent all along — we just weren't looking at the scoreboard right. (Turns to camera) And by the way, the leaderboard leaders? They're using stride=64, not 256. We haven't even maxed this out yet.
Slice Shanksalot SS

The Card

Scorecard
Result Free Lunch

Picked up strokes on the field

This hole improved 0.0349 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 0 bytes of artifact headroom.

1.3055 i +0.0811 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.2043 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 18,380,000 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 16.71 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 600s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 2,541 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 236ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 16.71 MB > 16MB OVER 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 1500 2000 2.3842 train_loss step
This hole Baseline (R2)
Post-Round Lesson

Massive free gain: 0.034 BPB. Sliding window eval is the single highest-ROI technique we've found. Eval time is too long at stride=256 on L40S (13.5 min) but will be fast enough on H100.

vs. the Field

+0.0858 vs SOTA (1.2197)
+0.0811 vs Baseline (1.2244)
+0.0811 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.3055
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
The leaderboard says everyone at the top uses sliding window eval. Instead of evaluating each token with whatever context it happens to have (tokens early in a chunk get...
Trent Fairway
Trent Fairway On The Tee
(Whispering) A quiet revolution today. The competitor has not changed a single weight, not modified a single layer. The model is precisely the same as Hole 12. What has c...
Slice Shanksalot
Slice Shanksalot Booth Reaction
(Staring at screen) Thirty-four thousandths. FOR FREE. You know what I did in Q-school '04 that made the difference? I didn't change my swing, didn't buy new clubs, didn'...

Model Card

How this hole was run

Run ID round_015_sw256
Status ok
Training Script train_gpt_valemb_sw.py
Backend cuda
Key Overrides
TRAIN_BATCH_TOKENS=131072VAL_SW_STRIDE=256MAX_WALLCLOCK_SECONDS=600
Back Up The Fairway Round 1, Hole 14 Two Tables, Same Story Head To The Next Tee Round 1, Hole 16 Fitting Through the Door