Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 4 Triple Bogey March 18, 2026

1.9464

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.7220 vs last hole: —
Tee Box R1 · H4
Artifact 7.49 MB
Headroom 8.51 MB Room left under 16 MB
Tempo
Looper
Looper The Caddie
Safe Tweaks

The 0.06 was too hot — way too much club for a 300-step hole. Let's go the other direction. MATRIX_LR=0.02. A softer swing. If we're only getting 300 steps, we need every single one to count. No overshooting, no wasted energy. Lay it up and let the optimizer do its thing.

Technical Read

If 0.06 was too hot for 300 steps, 0.02 might be the calmer setting that fits our local regime.

Trent Fairway
Trent Fairway On the Tee

(Whispering) After the rather... exuberant display of Hole 3, the competitor has reached for a gentler club. A learning rate of zero-point-zero-two. Half the force of the baseline. One senses a lesson has been learned about restraint. The question now is whether caution alone can find the fairway.

Looper’s Pick

The 0.06 was too hot — way too much club for a 300-step hole. Let’s go the other direction. MATRIX_LR=0.02. A softer swing. If we’re only getting 300 steps, we need every single one to count. No overshooting, no wasted energy. Lay it up and let the optimizer do its thing.

The Shot — Dialing Back the Learning Rate

Why would a *lower* learning rate help when you have fewer training steps?

Imagine you’re putting on a sloped green. You could whack it hard and hope it banks off the far lip and drops in — that’s a high learning rate. Or you could read the break carefully and give it just enough pace to die at the hole — that’s a low learning rate.

When you have thousands of putts (training steps), the aggressive approach works: even if you blast past the hole, you get another try, and another. Eventually the ball finds the cup. But when you only get a few attempts — say, 300 — you can’t afford to overshoot. Each step needs to make measured, reliable progress toward the minimum.

In Hole 3, we saw this play out exactly. The higher learning rate (0.06) had the model’s train loss at 3.50 at step 200, while the baseline (0.04) was already at 2.74. The higher LR was making bigger updates per step, but those updates were overcorrecting — bouncing back and forth across the loss landscape instead of converging smoothly.

A learning rate of 0.02 makes each gradient update half the size of the baseline. The model moves more cautiously through the loss landscape. The trade-off: it may converge more slowly in absolute terms, meaning you need more steps to reach the same final loss. But in a step-limited regime, the stability can more than compensate.

There’s a classic result in optimization theory called the “learning rate-step count trade-off”: for a fixed compute budget, the optimal learning rate decreases as you have fewer steps. The NanoGPT speedrun community has found this holds empirically for transformer training — though the relationship isn’t always simple, especially with momentum-based optimizers like Muon.

On the Tee

(Whispering) After the rather… exuberant display of Hole 3, the competitor has reached for a gentler club. A learning rate of zero-point-zero-two. Half the force of the baseline. One senses a lesson has been learned about restraint. The question now is whether caution alone can find the fairway.

Results

MetricValue
val_bpb1.9464
val_loss3.2865
params17,059,912
artifact7.49 MB (yes < 16MB)
wall time300s
steps completed300 / 20,000

Learning Rate Bracket (all at step 200)

MATRIX_LRtrain_loss @ 200val_bpb @ 300
0.04 (baseline)2.74271.4292*
0.02 (this hole)3.28151.8371
0.06 (Hole 3)3.50291.9648

*Baseline ran for 600s / 599 steps, not 300s / 300 steps.

The Booth Reacts

Trent: Well now. Three-point-two-eight at step two hundred. An improvement over the zero-point-zero-six debacle, certainly, but still trailing the baseline’s two-point-seven-four by a comfortable margin. The lower learning rate has brought stability, one observes — note how much more composed the early steps are — but the default appears to have found the rather better balance for this step count.

Slice: OK so we went conservative and it’s STILL worse than the factory settings? Boss, I gotta be honest with you — and you know I’m ALWAYS honest — the default 0.04 is looking like the right club here. It’s not sexy. It’s not what the leaderboard guys are using. But they’re playing an 8xH100 course and we’re on a municipal L40S. Different course, different strategy. When I was qualifying in ‘04, you know what won? NOT trying to be clever.

Trent: Quite. One suspects the caddy may be arriving at a similar conclusion. The learning rate, it would appear, is not where the strokes are to be found on this particular hardware.

The Booth Reacts

Trent Fairway TF
Trent Fairway
Well now. Three-point-two-eight at step two hundred. An improvement over the zero-point-zero-six debacle, certainly, but still trailing the baseline's two-point-seven-four by a comfortable margin. The lower learning rate has brought stability, one observes — note how much more composed the early steps are — but the default appears to have found the rather better balance for this step count.
Slice Shanksalot
OK so we went conservative and it's STILL worse than the factory settings? Boss, I gotta be honest with you — and you know I'm ALWAYS honest — the default 0.04 is looking like the right club here. It's not sexy. It's not what the leaderboard guys are using. But they're playing an 8xH100 course and we're on a municipal L40S. Different course, different strategy. When I was qualifying in '04, you know what won? NOT trying to be clever.
Slice Shanksalot SS
Trent Fairway TF
Trent Fairway
Quite. One suspects the caddy may be arriving at a similar conclusion. The learning rate, it would appear, is not where the strokes are to be found on this particular hardware.

The Card

Scorecard
Result Dead End

Baseline still has the honor

This score sits +0.7220 versus the official baseline. Lower is better because the model is spending fewer bits to model the same text, with 8,514,666 bytes left in the bag.

1.9464 i +0.7220 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 3.2865 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 17,059,912 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 7.49 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 300s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores.
0 MB 7.49 MB < 16MB limit 16 MB
Post-Round Lesson

The stock 0.04 is a better compromise here. Learning rate is not the first lever to pull on this hardware.

vs. the Field

+0.7267 vs SOTA (1.2197)
+0.7220 vs Baseline (1.2244)
+0.7220 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.9464
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
The 0.06 was too hot — way too much club for a 300-step hole. Let's go the other direction. MATRIX_LR=0.02. A softer swing. If we're only getting 300 steps, we need every...
Trent Fairway
Trent Fairway On The Tee
(Whispering) After the rather... exuberant display of Hole 3, the competitor has reached for a gentler club. A learning rate of zero-point-zero-two. Half the force of the...
Slice Shanksalot
Slice Shanksalot Booth Reaction
OK so we went conservative and it's STILL worse than the factory settings? Boss, I gotta be honest with you — and you know I'm ALWAYS honest — the default 0.04 is looking...

Model Card

How this hole was run

Run ID round_004_lr02
Status ok
Backend cuda
Key Overrides
MATRIX_LR=0.02MAX_WALLCLOCK_SECONDS=300
Back Up The Fairway Round 1, Hole 3 Grip It and Rip It Head To The Next Tee Round 1, Hole 5 More Swings, Less Club