Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 3 Triple Bogey March 18, 2026

2.0574

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.8330 vs last hole: —
Tee Box R1 · H3
Artifact 7.61 MB
Headroom 8.39 MB Room left under 16 MB
Tempo
Looper
Looper The Caddie
Safe Tweaks

So PR #42 on the leaderboard beat the baseline by bumping the Muon learning rate from 0.04 to 0.06 and extending the warmdown. They got 1.2197 on 8xH100. I say we try the same play — it's proven, it's simple, and it's one env var change. Let's see if the L40S can feel the difference.

Technical Read

If PR #42's hotter Muon learning rate works directionally on L40S, we can inherit the public best recipe quickly.

Trent Fairway
Trent Fairway On the Tee

(Whispering) A bold club selection here. The competitor has reached for the driver — a Muon learning rate of zero-point-zero-six. This is the same club that claimed the current leaderboard position. The question is whether three hundred steps is enough fairway to find the green.

Looper’s Pick

So PR #42 on the leaderboard beat the baseline by bumping the Muon learning rate from 0.04 to 0.06 and extending the warmdown. They got 1.2197 on 8xH100. I say we try the same play — it’s proven, it’s simple, and it’s one env var change. Let’s see if the L40S can feel the difference.

The Shot — Tuning the Muon Learning Rate

What is a learning rate, and why does bumping it from 0.04 to 0.06 matter?

Think of the learning rate as how aggressively a golfer swings at the ball. A gentle swing (low LR) is controlled and accurate but doesn’t cover much distance. A powerful swing (high LR) covers ground fast but risks sending the ball into the trees.

In neural network training, the learning rate controls how much the model’s weights change after each batch of training data. The model computes a “gradient” — essentially, a direction that says “adjust these weights this way to reduce the loss” — and the learning rate determines how big a step to take in that direction.

Our model uses the Muon optimizer for its main weight matrices (41% of all parameters). Muon is a relatively new optimizer from the NanoGPT speedrun community that works differently from the standard Adam optimizer. Instead of maintaining running averages of gradients, Muon applies a mathematical operation called Newton-Schulz orthogonalization — it normalizes the gradient update so it’s approximately an orthogonal matrix. This produces updates that are more “evenly distributed” across the parameter space, which empirically converges faster.

The current best submission on the Parameter Golf leaderboard (PR #42 by chonchiog) achieved 1.2197 BPB — beating the baseline’s 1.2244 — with just two changes: bumping MATRIX_LR from 0.04 to 0.06, and extending WARMDOWN_ITERS from 1200 to 3600. The warmdown controls how the learning rate decays at the end of training; a longer warmdown means a more gradual cooldown, giving the model more time to settle into a good minimum.

The catch: this was validated with ~13,000 training steps on 8xH100. We’re running on a single L40S where we get about 300 steps in 5 minutes. A higher learning rate needs more steps to recover from early oscillations — we may be testing the right idea on the wrong timescale.

On the Tee

(Whispering) A bold club selection here. The competitor has reached for the driver — a Muon learning rate of zero-point-zero-six. This is the same club that claimed the current leaderboard position. The question is whether three hundred steps is enough fairway to find the green.

Results

MetricValue
val_bpb2.0574
val_loss3.4738
params17,059,912
artifact7.61 MB (yes < 16MB)
wall time300s
steps completed300 / 20,000

Comparison at Step 200

Runtrain_loss @ step 200MATRIX_LR
Hole 2 (baseline)2.74270.04
Hole 33.50290.06

The higher LR is 0.76 nats worse at the same step count. It’s overshooting.

Training Curve (tail)

StepLossAvg Step
107.77731000.6ms
2003.50291001.7ms
300— (val: 3.3175)1001.7ms

The Booth Reacts

Trent: Well. That was… rather agricultural, wasn’t it. Three-point-five at step two hundred, where the baseline was already at two-point-seven. One suspects the higher learning rate needs rather more fairway to work with than three hundred steps can provide. A promising club, perhaps, but on the wrong course today.

Slice: Boss. BOSS. A 3.50 train loss at step 200?! I’ve seen better numbers from a GUY USING A SAND WEDGE FROM THE PARKING LOT. Look, I’m not saying 0.06 is wrong — when I was qualifying in ‘04 I was ALL about aggressive play — but you gotta have the runway for it. Three hundred steps with a hot LR is like trying to drive the green on a par 5. Respect the geometry, people!

Trent: The caddy appears to have a rather different idea for Hole 4. Something more… measured, I’m told. We shall see.

The Booth Reacts

Trent Fairway TF
Trent Fairway
Well. That was... rather agricultural, wasn't it. Three-point-five at step two hundred, where the baseline was already at two-point-seven. One suspects the higher learning rate needs rather more fairway to work with than three hundred steps can provide. A promising club, perhaps, but on the wrong course today.
Slice Shanksalot
Boss. BOSS. A 3.50 train loss at step 200?! I've seen better numbers from a GUY USING A SAND WEDGE FROM THE PARKING LOT. Look, I'm not saying 0.06 is wrong — when I was qualifying in '04 I was ALL about aggressive play — but you gotta have the runway for it. Three hundred steps with a hot LR is like trying to drive the green on a par 5. Respect the geometry, people!
Slice Shanksalot SS
Trent Fairway TF
Trent Fairway
The caddy appears to have a rather different idea for Hole 4. Something more... measured, I'm told. We shall see.

The Card

Scorecard
Result Dead End

Baseline still has the honor

This score sits +0.8330 versus the official baseline. Lower is better because the model is spending fewer bits to model the same text, with 8,394,315 bytes left in the bag.

2.0574 i +0.8330 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 3.4738 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 17,059,912 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 7.61 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 300s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores.
0 MB 7.61 MB < 16MB limit 16 MB
Post-Round Lesson

Some leaderboard ideas are step-budget sensitive. A good H100 recipe can be a terrible L40S recipe.

vs. the Field

+0.8377 vs SOTA (1.2197)
+0.8330 vs Baseline (1.2244)
+0.8330 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
2.0574
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
So PR #42 on the leaderboard beat the baseline by bumping the Muon learning rate from 0.04 to 0.06 and extending the warmdown. They got 1.2197 on 8xH100. I say we try the...
Trent Fairway
Trent Fairway On The Tee
(Whispering) A bold club selection here. The competitor has reached for the driver — a Muon learning rate of zero-point-zero-six. This is the same club that claimed the c...
Slice Shanksalot
Slice Shanksalot Booth Reaction
Boss. BOSS. A 3.50 train loss at step 200?! I've seen better numbers from a GUY USING A SAND WEDGE FROM THE PARKING LOT. Look, I'm not saying 0.06 is wrong — when I was q...

Model Card

How this hole was run

Run ID round_003_lr06_wd3600
Status ok
Backend cuda
Key Overrides
MATRIX_LR=0.06WARMDOWN_ITERS=3600MAX_WALLCLOCK_SECONDS=300
Back Up The Fairway Round 1, Hole 2 First Contact Head To The Next Tee Round 1, Hole 4 Leaving It Short