The 0.06 was too hot — way too much club for a 300-step hole. Let's go the other direction. MATRIX_LR=0.02. A softer swing. If we're only getting 300 steps, we need every...
1.9464
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
The 0.06 was too hot — way too much club for a 300-step hole. Let's go the other direction. MATRIX_LR=0.02. A softer swing. If we're only getting 300 steps, we need every single one to count. No overshooting, no wasted energy. Lay it up and let the optimizer do its thing.
If 0.06 was too hot for 300 steps, 0.02 might be the calmer setting that fits our local regime.
(Whispering) After the rather... exuberant display of Hole 3, the competitor has reached for a gentler club. A learning rate of zero-point-zero-two. Half the force of the baseline. One senses a lesson has been learned about restraint. The question now is whether caution alone can find the fairway.
Looper’s Pick
The 0.06 was too hot — way too much club for a 300-step hole. Let’s go the other direction. MATRIX_LR=0.02. A softer swing. If we’re only getting 300 steps, we need every single one to count. No overshooting, no wasted energy. Lay it up and let the optimizer do its thing.
The Shot — Dialing Back the Learning Rate
Why would a *lower* learning rate help when you have fewer training steps?
Imagine you’re putting on a sloped green. You could whack it hard and hope it banks off the far lip and drops in — that’s a high learning rate. Or you could read the break carefully and give it just enough pace to die at the hole — that’s a low learning rate.
When you have thousands of putts (training steps), the aggressive approach works: even if you blast past the hole, you get another try, and another. Eventually the ball finds the cup. But when you only get a few attempts — say, 300 — you can’t afford to overshoot. Each step needs to make measured, reliable progress toward the minimum.
In Hole 3, we saw this play out exactly. The higher learning rate (0.06) had the model’s train loss at 3.50 at step 200, while the baseline (0.04) was already at 2.74. The higher LR was making bigger updates per step, but those updates were overcorrecting — bouncing back and forth across the loss landscape instead of converging smoothly.
A learning rate of 0.02 makes each gradient update half the size of the baseline. The model moves more cautiously through the loss landscape. The trade-off: it may converge more slowly in absolute terms, meaning you need more steps to reach the same final loss. But in a step-limited regime, the stability can more than compensate.
There’s a classic result in optimization theory called the “learning rate-step count trade-off”: for a fixed compute budget, the optimal learning rate decreases as you have fewer steps. The NanoGPT speedrun community has found this holds empirically for transformer training — though the relationship isn’t always simple, especially with momentum-based optimizers like Muon.
On the Tee
(Whispering) After the rather… exuberant display of Hole 3, the competitor has reached for a gentler club. A learning rate of zero-point-zero-two. Half the force of the baseline. One senses a lesson has been learned about restraint. The question now is whether caution alone can find the fairway.
Results
| Metric | Value |
|---|---|
| val_bpb | 1.9464 |
| val_loss | 3.2865 |
| params | 17,059,912 |
| artifact | 7.49 MB (yes < 16MB) |
| wall time | 300s |
| steps completed | 300 / 20,000 |
Learning Rate Bracket (all at step 200)
| MATRIX_LR | train_loss @ 200 | val_bpb @ 300 |
|---|---|---|
| 0.04 (baseline) | 2.7427 | 1.4292* |
| 0.02 (this hole) | 3.2815 | 1.8371 |
| 0.06 (Hole 3) | 3.5029 | 1.9648 |
*Baseline ran for 600s / 599 steps, not 300s / 300 steps.
The Booth Reacts
Trent: Well now. Three-point-two-eight at step two hundred. An improvement over the zero-point-zero-six debacle, certainly, but still trailing the baseline’s two-point-seven-four by a comfortable margin. The lower learning rate has brought stability, one observes — note how much more composed the early steps are — but the default appears to have found the rather better balance for this step count.
Slice: OK so we went conservative and it’s STILL worse than the factory settings? Boss, I gotta be honest with you — and you know I’m ALWAYS honest — the default 0.04 is looking like the right club here. It’s not sexy. It’s not what the leaderboard guys are using. But they’re playing an 8xH100 course and we’re on a municipal L40S. Different course, different strategy. When I was qualifying in ‘04, you know what won? NOT trying to be clever.
Trent: Quite. One suspects the caddy may be arriving at a similar conclusion. The learning rate, it would appear, is not where the strokes are to be found on this particular hardware.
The Booth Reacts
The Card
Baseline still has the honor
This score sits +0.7220 versus the official baseline. Lower is better because the model is spending fewer bits to model the same text, with 8,514,666 bytes left in the bag.
The stock 0.04 is a better compromise here. Learning rate is not the first lever to pull on this hardware.
vs. the Field
1.2197
1.2244
1.2244
1.9464
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) After the rather... exuberant display of Hole 3, the competitor has reached for a gentler club. A learning rate of zero-point-zero-two. Half the force of the...
OK so we went conservative and it's STILL worse than the factory settings? Boss, I gotta be honest with you — and you know I'm ALWAYS honest — the default 0.04 is looking...
Model Card
How this hole was run
round_004_lr02 ok cuda