Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 5 Triple Bogey March 18, 2026

1.4139

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1895 vs last hole: -0.0222
Tee Box R1 · H5
Artifact 13.35 MB
Headroom 2.65 MB Room left under 16 MB
Tempo 229 ms 1,309 steps
Looper
Looper The Caddie
Safe Tweaks

Boss, we've been thinking about this wrong. We're not step-limited because the model is slow — we're step-limited because we're stuffing 524K tokens into every step. That's a big, expensive swing that takes a full second. What if we quarter the batch? 131K tokens per step. Each step is noisier, sure, but we get four times as many swings. On a GPU where wall time is everything, I'd rather take 1300 imperfect swings than 300 perfect ones.

Technical Read

On slower hardware, quartering the batch will buy enough extra optimizer steps to outweigh the noisier gradient.

Trent Fairway
Trent Fairway On the Tee

(Whispering) A rather unconventional approach from the caddy here. Instead of the full driver — half a million tokens per swing — the competitor has elected for a... well, one might call it a controlled punch shot. One hundred and thirty-one thousand tokens. The gallery seems skeptical. But the caddy appears to know something the rest of us do not.

Looper’s Pick

Boss, we’ve been thinking about this wrong. We’re not step-limited because the model is slow — we’re step-limited because we’re stuffing 524K tokens into every step. That’s a big, expensive swing that takes a full second. What if we quarter the batch? 131K tokens per step. Each step is noisier, sure, but we get four times as many swings. On a GPU where wall time is everything, I’d rather take 1300 imperfect swings than 300 perfect ones.

The Shot — Batch Size vs Step Count

Why would seeing less data per step lead to a better model?

In golf, there are two schools of thought on the driving range. Some players take 50 careful, deliberate swings with full analysis between each one. Others bang out 200 quick swings, relying on muscle memory and volume to groove the motion. Neither approach is universally better — it depends on the time you have.

In neural network training, each “step” processes a batch of data, computes how wrong the model’s predictions are (the loss), and updates the weights accordingly. A larger batch gives you a more accurate estimate of the true gradient — like taking a careful, well-aimed swing. A smaller batch gives you a noisier estimate — more like a quick hack at the range.

The key insight: on our L40S GPU, wall time is the constraint, not data quality. The baseline processes 524,288 tokens per step, taking about 1 second per step. That gives us ~300 steps in 5 minutes. By quartering the batch to 131,072 tokens, each step takes only ~229 milliseconds — about 4.4x faster. In the same 5 minutes, we get 1,309 steps instead of 300.

The gradient from each small batch is noisier, yes. But the Muon optimizer’s Newton-Schulz orthogonalization actually helps here: it normalizes the gradient update regardless of magnitude, which provides a natural smoothing effect that partially compensates for batch noise.

The results speak for themselves. At step 200, the small batch has a higher train loss (3.08 vs 2.74) — each individual step is less precise. But by step 1309, the model has converged to 1.4139 BPB, beating the full-batch 10-minute run’s 1.4361 in half the time. The volume of updates overwhelmed the per-step quality disadvantage.

This principle is well-known in the optimization literature. There’s a “critical batch size” below which you’re wasting data (too noisy to learn) and above which you’re wasting compute (diminishing returns per token). The default 524K batch was tuned for 8xH100 throughput. On our single L40S, the optimal batch size is much smaller.

Bonus discovery: Memory usage dropped from 10.2 GB to 2.8 GB. The smaller batch barely touches the L40S’s 48GB VRAM, meaning we have enormous headroom for larger models later.

On the Tee

(Whispering) A rather unconventional approach from the caddy here. Instead of the full driver — half a million tokens per swing — the competitor has elected for a… well, one might call it a controlled punch shot. One hundred and thirty-one thousand tokens. The gallery seems skeptical. But the caddy appears to know something the rest of us do not.

Results

MetricValue
val_bpb1.4139
val_loss2.3873
params17,059,912
artifact13.35 MB (yes < 16MB)
wall time300s
steps completed1,309 / 20,000
step avg229ms
peak memory2,798 MiB

Head-to-Head vs Baseline (same 5-minute wall time)

Baseline (H2, 10min)Small Batch (H5, 5min)
Batch tokens524,288131,072
Step time1,002ms229ms
Steps in 5min~3001,309
val_bpb1.4361 (at 10min)1.4139 (at 5min)

Training Curve

StepLossAvg Step
2003.0822229.3ms
4002.8081229.2ms
6002.6599229.2ms
8002.5163229.7ms
10002.4391229.5ms
12002.3691229.4ms
1309— (val: 2.3857)229.4ms

The Booth Reacts

Trent: (Leaning forward, genuinely surprised) Well I… I must confess I did not see that coming. One-point-four-one-three-nine. In five minutes. That is better than the ten-minute baseline run. The caddy has done something rather clever here — trading precision for volume. It’s the cricketing equivalent of playing for singles instead of boundaries. Unglamorous, but the scoreboard does not lie.

Slice: (Standing up from chair) DID EVERYBODY SEE THAT?! 1.41 in FIVE MINUTES! The baseline needed TEN minutes to get 1.43! This kid just figured out the cheat code — more swings, less thinking! And look at the memory — 2.8 gigs! We went from using a DUMP TRUCK to a GOLF CART and it’s FASTER! When I was qualifying in ‘04 I would’ve KILLED for this kind of insight. You know what the real play is here? We run this batch size with a LONGER wall time and we’re going to the MOON.

Trent: (Adjusting earpiece) I’m told the caddy is already preparing the next club selection. Something about… a larger vocabulary? We shall see.

The Booth Reacts

Trent Fairway TF
Trent Fairway
(Leaning forward, genuinely surprised) Well I... I must confess I did not see that coming. One-point-four-one-three-nine. In five minutes. That is better than the ten-minute baseline run. The caddy has done something rather clever here — trading precision for volume. It's the cricketing equivalent of playing for singles instead of boundaries. Unglamorous, but the scoreboard does not lie.
Slice Shanksalot
(Standing up from chair) DID EVERYBODY SEE THAT?! 1.41 in FIVE MINUTES! The baseline needed TEN minutes to get 1.43! This kid just figured out the cheat code — more swings, less thinking! And look at the memory — 2.8 gigs! We went from using a DUMP TRUCK to a GOLF CART and it's FASTER! When I was qualifying in '04 I would've KILLED for this kind of insight. You know what the real play is here? We run this batch size with a LONGER wall time and we're going to the MOON.
Slice Shanksalot SS
Trent Fairway TF
Trent Fairway
(Adjusting earpiece) I'm told the caddy is already preparing the next club selection. Something about... a larger vocabulary? We shall see.

The Card

Scorecard
Result Free Lunch

Picked up strokes on the field

This hole improved 0.0222 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 2,648,080 bytes of artifact headroom.

1.4139 i +0.1895 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.3873 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 17,059,912 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 13.35 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 300s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 1,309 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 229ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 13.35 MB < 16MB limit 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 2.3691 train_loss step
This hole Baseline (R2)
Post-Round Lesson

This was the first clean win. More swings mattered far more than prettier individual updates.

vs. the Field

+0.1942 vs SOTA (1.2197)
+0.1895 vs Baseline (1.2244)
+0.1895 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.4139
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
Boss, we've been thinking about this wrong. We're not step-limited because the model is slow — we're step-limited because we're stuffing 524K tokens into every step. That...
Trent Fairway
Trent Fairway On The Tee
(Whispering) A rather unconventional approach from the caddy here. Instead of the full driver — half a million tokens per swing — the competitor has elected for a... well...
Slice Shanksalot
Slice Shanksalot Booth Reaction
(Standing up from chair) DID EVERYBODY SEE THAT?! 1.41 in FIVE MINUTES! The baseline needed TEN minutes to get 1.43! This kid just figured out the cheat code — more swing...

Model Card

How this hole was run

Run ID round_005_smallbatch
Status ok
Backend cuda
Key Overrides
TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=300
Back Up The Fairway Round 1, Hole 4 Leaving It Short Head To The Next Tee Round 1, Hole 6 The Quant Gap Fix