Hole 4: Leaving It Short — Gradient Descent Country Club

Round 1, Hole 4 Triple Bogey March 18, 2026

1.9464

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.7220 vs last hole: —

Tee Box R1 · H4

Artifact 7.49 MB

Headroom 8.51 MB Room left under 16 MB

Tempo —

Looper The Caddie

Safe Tweaks

The 0.06 was too hot — way too much club for a 300-step hole. Let's go the other direction. MATRIX_LR=0.02. A softer swing. If we're only getting 300 steps, we need every single one to count. No overshooting, no wasted energy. Lay it up and let the optimizer do its thing.

Technical Read

If 0.06 was too hot for 300 steps, 0.02 might be the calmer setting that fits our local regime.

Trent Fairway On the Tee

(Whispering) After the rather... exuberant display of Hole 3, the competitor has reached for a gentler club. A learning rate of zero-point-zero-two. Half the force of the baseline. One senses a lesson has been learned about restraint. The question now is whether caution alone can find the fairway.

Looper’s Pick

The 0.06 was too hot — way too much club for a 300-step hole. Let’s go the other direction. MATRIX_LR=0.02. A softer swing. If we’re only getting 300 steps, we need every single one to count. No overshooting, no wasted energy. Lay it up and let the optimizer do its thing.

The Shot — Dialing Back the Learning Rate

Why would a *lower* learning rate help when you have fewer training steps?

Imagine you’re putting on a sloped green. You could whack it hard and hope it banks off the far lip and drops in — that’s a high learning rate. Or you could read the break carefully and give it just enough pace to die at the hole — that’s a low learning rate.

When you have thousands of putts (training steps), the aggressive approach works: even if you blast past the hole, you get another try, and another. Eventually the ball finds the cup. But when you only get a few attempts — say, 300 — you can’t afford to overshoot. Each step needs to make measured, reliable progress toward the minimum.

In Hole 3, we saw this play out exactly. The higher learning rate (0.06) had the model’s train loss at 3.50 at step 200, while the baseline (0.04) was already at 2.74. The higher LR was making bigger updates per step, but those updates were overcorrecting — bouncing back and forth across the loss landscape instead of converging smoothly.

A learning rate of 0.02 makes each gradient update half the size of the baseline. The model moves more cautiously through the loss landscape. The trade-off: it may converge more slowly in absolute terms, meaning you need more steps to reach the same final loss. But in a step-limited regime, the stability can more than compensate.

There’s a classic result in optimization theory called the “learning rate-step count trade-off”: for a fixed compute budget, the optimal learning rate decreases as you have fewer steps. The NanoGPT speedrun community has found this holds empirically for transformer training — though the relationship isn’t always simple, especially with momentum-based optimizers like Muon.

On the Tee

(Whispering) After the rather… exuberant display of Hole 3, the competitor has reached for a gentler club. A learning rate of zero-point-zero-two. Half the force of the baseline. One senses a lesson has been learned about restraint. The question now is whether caution alone can find the fairway.

Results

Metric	Value
val_bpb	1.9464
val_loss	3.2865
params	17,059,912
artifact	7.49 MB (yes < 16MB)
wall time	300s
steps completed	300 / 20,000

Learning Rate Bracket (all at step 200)

MATRIX_LR	train_loss @ 200	val_bpb @ 300
0.04 (baseline)	2.7427	1.4292*
0.02 (this hole)	3.2815	1.8371
0.06 (Hole 3)	3.5029	1.9648

*Baseline ran for 600s / 599 steps, not 300s / 300 steps.

The Booth Reacts

Trent: Well now. Three-point-two-eight at step two hundred. An improvement over the zero-point-zero-six debacle, certainly, but still trailing the baseline’s two-point-seven-four by a comfortable margin. The lower learning rate has brought stability, one observes — note how much more composed the early steps are — but the default appears to have found the rather better balance for this step count.

Slice: OK so we went conservative and it’s STILL worse than the factory settings? Boss, I gotta be honest with you — and you know I’m ALWAYS honest — the default 0.04 is looking like the right club here. It’s not sexy. It’s not what the leaderboard guys are using. But they’re playing an 8xH100 course and we’re on a municipal L40S. Different course, different strategy. When I was qualifying in ‘04, you know what won? NOT trying to be clever.

Trent: Quite. One suspects the caddy may be arriving at a similar conclusion. The learning rate, it would appear, is not where the strokes are to be found on this particular hardware.

The Booth Reacts

Trent Fairway

Well now. Three-point-two-eight at step two hundred. An improvement over the zero-point-zero-six debacle, certainly, but still trailing the baseline's two-point-seven-four by a comfortable margin. The lower learning rate has brought stability, one observes — note how much more composed the early steps are — but the default appears to have found the rather better balance for this step count.

Slice Shanksalot

OK so we went conservative and it's STILL worse than the factory settings? Boss, I gotta be honest with you — and you know I'm ALWAYS honest — the default 0.04 is looking like the right club here. It's not sexy. It's not what the leaderboard guys are using. But they're playing an 8xH100 course and we're on a municipal L40S. Different course, different strategy. When I was qualifying in '04, you know what won? NOT trying to be clever.

Trent Fairway

Quite. One suspects the caddy may be arriving at a similar conclusion. The learning rate, it would appear, is not where the strokes are to be found on this particular hardware.

The Card

Scorecard

Result Dead End

Baseline still has the honor

This score sits +0.7220 versus the official baseline. Lower is better because the model is spending fewer bits to model the same text, with 8,514,666 bytes left in the bag.

1.9464 i +0.7220 val bpb 3.2865 i val loss 17,059,912 i params 7.49 MB i artifact 300s i wall time

0 MB 7.49 MB < 16MB limit 16 MB

Post-Round Lesson

The stock 0.04 is a better compromise here. Learning rate is not the first lever to pull on this hardware.

vs. the Field

+0.7267 vs SOTA (1.2197)

+0.7220 vs Baseline (1.2244)

+0.7220 vs Our Best (1.2244)

SOTA
1.2197

Baseline
1.2244

Our Best
1.2244

This Hole
1.9464

← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper Caddie Notebook

The 0.06 was too hot — way too much club for a 300-step hole. Let's go the other direction. MATRIX_LR=0.02. A softer swing. If we're only getting 300 steps, we need every...

Trent Fairway On The Tee

(Whispering) After the rather... exuberant display of Hole 3, the competitor has reached for a gentler club. A learning rate of zero-point-zero-two. Half the force of the...

Slice Shanksalot Booth Reaction

OK so we went conservative and it's STILL worse than the factory settings? Boss, I gotta be honest with you — and you know I'm ALWAYS honest — the default 0.04 is looking...

Model Card

How this hole was run

Run ID round_004_lr02

Status ok

Backend cuda

Key Overrides

MATRIX_LR=0.02MAX_WALLCLOCK_SECONDS=300

Back Up The Fairway Round 1, Hole 3 Grip It and Rip It Head To The Next Tee Round 1, Hole 5 More Swings, Less Club