Hole 3: Grip It and Rip It — Gradient Descent Country Club

Round 1, Hole 3 Triple Bogey March 18, 2026

2.0574

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.8330 vs last hole: —

Tee Box R1 · H3

Artifact 7.61 MB

Headroom 8.39 MB Room left under 16 MB

Tempo —

Looper The Caddie

Safe Tweaks

So PR #42 on the leaderboard beat the baseline by bumping the Muon learning rate from 0.04 to 0.06 and extending the warmdown. They got 1.2197 on 8xH100. I say we try the same play — it's proven, it's simple, and it's one env var change. Let's see if the L40S can feel the difference.

Technical Read

If PR #42's hotter Muon learning rate works directionally on L40S, we can inherit the public best recipe quickly.

Trent Fairway On the Tee

(Whispering) A bold club selection here. The competitor has reached for the driver — a Muon learning rate of zero-point-zero-six. This is the same club that claimed the current leaderboard position. The question is whether three hundred steps is enough fairway to find the green.

Looper’s Pick

So PR #42 on the leaderboard beat the baseline by bumping the Muon learning rate from 0.04 to 0.06 and extending the warmdown. They got 1.2197 on 8xH100. I say we try the same play — it’s proven, it’s simple, and it’s one env var change. Let’s see if the L40S can feel the difference.

The Shot — Tuning the Muon Learning Rate

What is a learning rate, and why does bumping it from 0.04 to 0.06 matter?

Think of the learning rate as how aggressively a golfer swings at the ball. A gentle swing (low LR) is controlled and accurate but doesn’t cover much distance. A powerful swing (high LR) covers ground fast but risks sending the ball into the trees.

In neural network training, the learning rate controls how much the model’s weights change after each batch of training data. The model computes a “gradient” — essentially, a direction that says “adjust these weights this way to reduce the loss” — and the learning rate determines how big a step to take in that direction.

Our model uses the Muon optimizer for its main weight matrices (41% of all parameters). Muon is a relatively new optimizer from the NanoGPT speedrun community that works differently from the standard Adam optimizer. Instead of maintaining running averages of gradients, Muon applies a mathematical operation called Newton-Schulz orthogonalization — it normalizes the gradient update so it’s approximately an orthogonal matrix. This produces updates that are more “evenly distributed” across the parameter space, which empirically converges faster.

The current best submission on the Parameter Golf leaderboard (PR #42 by chonchiog) achieved 1.2197 BPB — beating the baseline’s 1.2244 — with just two changes: bumping MATRIX_LR from 0.04 to 0.06, and extending WARMDOWN_ITERS from 1200 to 3600. The warmdown controls how the learning rate decays at the end of training; a longer warmdown means a more gradual cooldown, giving the model more time to settle into a good minimum.

The catch: this was validated with ~13,000 training steps on 8xH100. We’re running on a single L40S where we get about 300 steps in 5 minutes. A higher learning rate needs more steps to recover from early oscillations — we may be testing the right idea on the wrong timescale.

On the Tee

Results

Metric	Value
val_bpb	2.0574
val_loss	3.4738
params	17,059,912
artifact	7.61 MB (yes < 16MB)
wall time	300s
steps completed	300 / 20,000

Comparison at Step 200

Run	train_loss @ step 200	MATRIX_LR
Hole 2 (baseline)	2.7427	0.04
Hole 3	3.5029	0.06

The higher LR is 0.76 nats worse at the same step count. It’s overshooting.

Training Curve (tail)

Step	Loss	Avg Step
10	7.7773	1000.6ms
200	3.5029	1001.7ms
300	— (val: 3.3175)	1001.7ms

The Booth Reacts

Trent: Well. That was… rather agricultural, wasn’t it. Three-point-five at step two hundred, where the baseline was already at two-point-seven. One suspects the higher learning rate needs rather more fairway to work with than three hundred steps can provide. A promising club, perhaps, but on the wrong course today.

Slice: Boss. BOSS. A 3.50 train loss at step 200?! I’ve seen better numbers from a GUY USING A SAND WEDGE FROM THE PARKING LOT. Look, I’m not saying 0.06 is wrong — when I was qualifying in ‘04 I was ALL about aggressive play — but you gotta have the runway for it. Three hundred steps with a hot LR is like trying to drive the green on a par 5. Respect the geometry, people!

Trent: The caddy appears to have a rather different idea for Hole 4. Something more… measured, I’m told. We shall see.

The Booth Reacts

Trent Fairway

Well. That was... rather agricultural, wasn't it. Three-point-five at step two hundred, where the baseline was already at two-point-seven. One suspects the higher learning rate needs rather more fairway to work with than three hundred steps can provide. A promising club, perhaps, but on the wrong course today.

Slice Shanksalot

Boss. BOSS. A 3.50 train loss at step 200?! I've seen better numbers from a GUY USING A SAND WEDGE FROM THE PARKING LOT. Look, I'm not saying 0.06 is wrong — when I was qualifying in '04 I was ALL about aggressive play — but you gotta have the runway for it. Three hundred steps with a hot LR is like trying to drive the green on a par 5. Respect the geometry, people!

Trent Fairway

The caddy appears to have a rather different idea for Hole 4. Something more... measured, I'm told. We shall see.

The Card

Scorecard

Result Dead End

Baseline still has the honor

This score sits +0.8330 versus the official baseline. Lower is better because the model is spending fewer bits to model the same text, with 8,394,315 bytes left in the bag.

2.0574 i +0.8330 val bpb 3.4738 i val loss 17,059,912 i params 7.61 MB i artifact 300s i wall time

0 MB 7.61 MB < 16MB limit 16 MB

Post-Round Lesson

Some leaderboard ideas are step-budget sensitive. A good H100 recipe can be a terrible L40S recipe.

vs. the Field

+0.8377 vs SOTA (1.2197)

+0.8330 vs Baseline (1.2244)

+0.8330 vs Our Best (1.2244)

SOTA
1.2197

Baseline
1.2244

Our Best
1.2244

This Hole
2.0574

← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper Caddie Notebook

So PR #42 on the leaderboard beat the baseline by bumping the Muon learning rate from 0.04 to 0.06 and extending the warmdown. They got 1.2197 on 8xH100. I say we try the...

Trent Fairway On The Tee

(Whispering) A bold club selection here. The competitor has reached for the driver — a Muon learning rate of zero-point-zero-six. This is the same club that claimed the c...

Slice Shanksalot Booth Reaction

Boss. BOSS. A 3.50 train loss at step 200?! I've seen better numbers from a GUY USING A SAND WEDGE FROM THE PARKING LOT. Look, I'm not saying 0.06 is wrong — when I was q...

Model Card

How this hole was run

Run ID round_003_lr06_wd3600

Status ok

Backend cuda

Key Overrides

MATRIX_LR=0.06WARMDOWN_ITERS=3600MAX_WALLCLOCK_SECONDS=300

Back Up The Fairway Round 1, Hole 2 First Contact Head To The Next Tee Round 1, Hole 4 Leaving It Short