So PR #42 on the leaderboard beat the baseline by bumping the Muon learning rate from 0.04 to 0.06 and extending the warmdown. They got 1.2197 on 8xH100. I say we try the...
2.0574
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
So PR #42 on the leaderboard beat the baseline by bumping the Muon learning rate from 0.04 to 0.06 and extending the warmdown. They got 1.2197 on 8xH100. I say we try the same play — it's proven, it's simple, and it's one env var change. Let's see if the L40S can feel the difference.
If PR #42's hotter Muon learning rate works directionally on L40S, we can inherit the public best recipe quickly.
(Whispering) A bold club selection here. The competitor has reached for the driver — a Muon learning rate of zero-point-zero-six. This is the same club that claimed the current leaderboard position. The question is whether three hundred steps is enough fairway to find the green.
Looper’s Pick
So PR #42 on the leaderboard beat the baseline by bumping the Muon learning rate from 0.04 to 0.06 and extending the warmdown. They got 1.2197 on 8xH100. I say we try the same play — it’s proven, it’s simple, and it’s one env var change. Let’s see if the L40S can feel the difference.
The Shot — Tuning the Muon Learning Rate
What is a learning rate, and why does bumping it from 0.04 to 0.06 matter?
Think of the learning rate as how aggressively a golfer swings at the ball. A gentle swing (low LR) is controlled and accurate but doesn’t cover much distance. A powerful swing (high LR) covers ground fast but risks sending the ball into the trees.
In neural network training, the learning rate controls how much the model’s weights change after each batch of training data. The model computes a “gradient” — essentially, a direction that says “adjust these weights this way to reduce the loss” — and the learning rate determines how big a step to take in that direction.
Our model uses the Muon optimizer for its main weight matrices (41% of all parameters). Muon is a relatively new optimizer from the NanoGPT speedrun community that works differently from the standard Adam optimizer. Instead of maintaining running averages of gradients, Muon applies a mathematical operation called Newton-Schulz orthogonalization — it normalizes the gradient update so it’s approximately an orthogonal matrix. This produces updates that are more “evenly distributed” across the parameter space, which empirically converges faster.
The current best submission on the Parameter Golf leaderboard (PR #42 by chonchiog) achieved 1.2197 BPB — beating the baseline’s 1.2244 — with just two changes: bumping MATRIX_LR from 0.04 to 0.06, and extending WARMDOWN_ITERS from 1200 to 3600. The warmdown controls how the learning rate decays at the end of training; a longer warmdown means a more gradual cooldown, giving the model more time to settle into a good minimum.
The catch: this was validated with ~13,000 training steps on 8xH100. We’re running on a single L40S where we get about 300 steps in 5 minutes. A higher learning rate needs more steps to recover from early oscillations — we may be testing the right idea on the wrong timescale.
On the Tee
(Whispering) A bold club selection here. The competitor has reached for the driver — a Muon learning rate of zero-point-zero-six. This is the same club that claimed the current leaderboard position. The question is whether three hundred steps is enough fairway to find the green.
Results
| Metric | Value |
|---|---|
| val_bpb | 2.0574 |
| val_loss | 3.4738 |
| params | 17,059,912 |
| artifact | 7.61 MB (yes < 16MB) |
| wall time | 300s |
| steps completed | 300 / 20,000 |
Comparison at Step 200
| Run | train_loss @ step 200 | MATRIX_LR |
|---|---|---|
| Hole 2 (baseline) | 2.7427 | 0.04 |
| Hole 3 | 3.5029 | 0.06 |
The higher LR is 0.76 nats worse at the same step count. It’s overshooting.
Training Curve (tail)
| Step | Loss | Avg Step |
|---|---|---|
| 10 | 7.7773 | 1000.6ms |
| 200 | 3.5029 | 1001.7ms |
| 300 | — (val: 3.3175) | 1001.7ms |
The Booth Reacts
Trent: Well. That was… rather agricultural, wasn’t it. Three-point-five at step two hundred, where the baseline was already at two-point-seven. One suspects the higher learning rate needs rather more fairway to work with than three hundred steps can provide. A promising club, perhaps, but on the wrong course today.
Slice: Boss. BOSS. A 3.50 train loss at step 200?! I’ve seen better numbers from a GUY USING A SAND WEDGE FROM THE PARKING LOT. Look, I’m not saying 0.06 is wrong — when I was qualifying in ‘04 I was ALL about aggressive play — but you gotta have the runway for it. Three hundred steps with a hot LR is like trying to drive the green on a par 5. Respect the geometry, people!
Trent: The caddy appears to have a rather different idea for Hole 4. Something more… measured, I’m told. We shall see.
The Booth Reacts
The Card
Baseline still has the honor
This score sits +0.8330 versus the official baseline. Lower is better because the model is spending fewer bits to model the same text, with 8,394,315 bytes left in the bag.
Some leaderboard ideas are step-budget sensitive. A good H100 recipe can be a terrible L40S recipe.
vs. the Field
1.2197
1.2244
1.2244
2.0574
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) A bold club selection here. The competitor has reached for the driver — a Muon learning rate of zero-point-zero-six. This is the same club that claimed the c...
Boss. BOSS. A 3.50 train loss at step 200?! I've seen better numbers from a GUY USING A SAND WEDGE FROM THE PARKING LOT. Look, I'm not saying 0.06 is wrong — when I was q...
Model Card
How this hole was run
round_003_lr06_wd3600 ok cuda