Hole 5: More Swings, Less Club — Gradient Descent Country Club

Round 1, Hole 5 Triple Bogey March 18, 2026

1.4139

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1895 vs last hole: -0.0222

Tee Box R1 · H5

Artifact 13.35 MB

Headroom 2.65 MB Room left under 16 MB

Tempo 229 ms 1,309 steps

Looper The Caddie

Safe Tweaks

Boss, we've been thinking about this wrong. We're not step-limited because the model is slow — we're step-limited because we're stuffing 524K tokens into every step. That's a big, expensive swing that takes a full second. What if we quarter the batch? 131K tokens per step. Each step is noisier, sure, but we get four times as many swings. On a GPU where wall time is everything, I'd rather take 1300 imperfect swings than 300 perfect ones.

Technical Read

On slower hardware, quartering the batch will buy enough extra optimizer steps to outweigh the noisier gradient.

Trent Fairway On the Tee

(Whispering) A rather unconventional approach from the caddy here. Instead of the full driver — half a million tokens per swing — the competitor has elected for a... well, one might call it a controlled punch shot. One hundred and thirty-one thousand tokens. The gallery seems skeptical. But the caddy appears to know something the rest of us do not.

Looper’s Pick

Boss, we’ve been thinking about this wrong. We’re not step-limited because the model is slow — we’re step-limited because we’re stuffing 524K tokens into every step. That’s a big, expensive swing that takes a full second. What if we quarter the batch? 131K tokens per step. Each step is noisier, sure, but we get four times as many swings. On a GPU where wall time is everything, I’d rather take 1300 imperfect swings than 300 perfect ones.

The Shot — Batch Size vs Step Count

Why would seeing less data per step lead to a better model?

In golf, there are two schools of thought on the driving range. Some players take 50 careful, deliberate swings with full analysis between each one. Others bang out 200 quick swings, relying on muscle memory and volume to groove the motion. Neither approach is universally better — it depends on the time you have.

In neural network training, each “step” processes a batch of data, computes how wrong the model’s predictions are (the loss), and updates the weights accordingly. A larger batch gives you a more accurate estimate of the true gradient — like taking a careful, well-aimed swing. A smaller batch gives you a noisier estimate — more like a quick hack at the range.

The key insight: on our L40S GPU, wall time is the constraint, not data quality. The baseline processes 524,288 tokens per step, taking about 1 second per step. That gives us ~300 steps in 5 minutes. By quartering the batch to 131,072 tokens, each step takes only ~229 milliseconds — about 4.4x faster. In the same 5 minutes, we get 1,309 steps instead of 300.

The gradient from each small batch is noisier, yes. But the Muon optimizer’s Newton-Schulz orthogonalization actually helps here: it normalizes the gradient update regardless of magnitude, which provides a natural smoothing effect that partially compensates for batch noise.

The results speak for themselves. At step 200, the small batch has a higher train loss (3.08 vs 2.74) — each individual step is less precise. But by step 1309, the model has converged to 1.4139 BPB, beating the full-batch 10-minute run’s 1.4361 in half the time. The volume of updates overwhelmed the per-step quality disadvantage.

This principle is well-known in the optimization literature. There’s a “critical batch size” below which you’re wasting data (too noisy to learn) and above which you’re wasting compute (diminishing returns per token). The default 524K batch was tuned for 8xH100 throughput. On our single L40S, the optimal batch size is much smaller.

Bonus discovery: Memory usage dropped from 10.2 GB to 2.8 GB. The smaller batch barely touches the L40S’s 48GB VRAM, meaning we have enormous headroom for larger models later.

On the Tee

(Whispering) A rather unconventional approach from the caddy here. Instead of the full driver — half a million tokens per swing — the competitor has elected for a… well, one might call it a controlled punch shot. One hundred and thirty-one thousand tokens. The gallery seems skeptical. But the caddy appears to know something the rest of us do not.

Results

Metric	Value
val_bpb	1.4139
val_loss	2.3873
params	17,059,912
artifact	13.35 MB (yes < 16MB)
wall time	300s
steps completed	1,309 / 20,000
step avg	229ms
peak memory	2,798 MiB

Head-to-Head vs Baseline (same 5-minute wall time)

	Baseline (H2, 10min)	Small Batch (H5, 5min)
Batch tokens	524,288	131,072
Step time	1,002ms	229ms
Steps in 5min	~300	1,309
val_bpb	1.4361 (at 10min)	1.4139 (at 5min)

Training Curve

Step	Loss	Avg Step
200	3.0822	229.3ms
400	2.8081	229.2ms
600	2.6599	229.2ms
800	2.5163	229.7ms
1000	2.4391	229.5ms
1200	2.3691	229.4ms
1309	— (val: 2.3857)	229.4ms

The Booth Reacts

Trent: (Leaning forward, genuinely surprised) Well I… I must confess I did not see that coming. One-point-four-one-three-nine. In five minutes. That is better than the ten-minute baseline run. The caddy has done something rather clever here — trading precision for volume. It’s the cricketing equivalent of playing for singles instead of boundaries. Unglamorous, but the scoreboard does not lie.

Slice: (Standing up from chair) DID EVERYBODY SEE THAT?! 1.41 in FIVE MINUTES! The baseline needed TEN minutes to get 1.43! This kid just figured out the cheat code — more swings, less thinking! And look at the memory — 2.8 gigs! We went from using a DUMP TRUCK to a GOLF CART and it’s FASTER! When I was qualifying in ‘04 I would’ve KILLED for this kind of insight. You know what the real play is here? We run this batch size with a LONGER wall time and we’re going to the MOON.

Trent: (Adjusting earpiece) I’m told the caddy is already preparing the next club selection. Something about… a larger vocabulary? We shall see.

The Booth Reacts

Trent Fairway

(Leaning forward, genuinely surprised) Well I... I must confess I did not see that coming. One-point-four-one-three-nine. In five minutes. That is better than the ten-minute baseline run. The caddy has done something rather clever here — trading precision for volume. It's the cricketing equivalent of playing for singles instead of boundaries. Unglamorous, but the scoreboard does not lie.

Slice Shanksalot

(Standing up from chair) DID EVERYBODY SEE THAT?! 1.41 in FIVE MINUTES! The baseline needed TEN minutes to get 1.43! This kid just figured out the cheat code — more swings, less thinking! And look at the memory — 2.8 gigs! We went from using a DUMP TRUCK to a GOLF CART and it's FASTER! When I was qualifying in '04 I would've KILLED for this kind of insight. You know what the real play is here? We run this batch size with a LONGER wall time and we're going to the MOON.

Trent Fairway

(Adjusting earpiece) I'm told the caddy is already preparing the next club selection. Something about... a larger vocabulary? We shall see.

The Card

Scorecard

Result Free Lunch

Picked up strokes on the field

This hole improved 0.0222 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 2,648,080 bytes of artifact headroom.

1.4139 i +0.1895 val bpb 2.3873 i val loss 17,059,912 i params 13.35 MB i artifact 300s i wall time 1,309 i steps 229ms i step avg

0 MB 13.35 MB < 16MB limit 16 MB

Training Curve

This hole Baseline (R2)

Post-Round Lesson

This was the first clean win. More swings mattered far more than prettier individual updates.

vs. the Field

+0.1942 vs SOTA (1.2197)

+0.1895 vs Baseline (1.2244)

+0.1895 vs Our Best (1.2244)

SOTA
1.2197

Baseline
1.2244

Our Best
1.2244

This Hole
1.4139

← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper Caddie Notebook

Boss, we've been thinking about this wrong. We're not step-limited because the model is slow — we're step-limited because we're stuffing 524K tokens into every step. That...

Trent Fairway On The Tee

(Whispering) A rather unconventional approach from the caddy here. Instead of the full driver — half a million tokens per swing — the competitor has elected for a... well...

Slice Shanksalot Booth Reaction

(Standing up from chair) DID EVERYBODY SEE THAT?! 1.41 in FIVE MINUTES! The baseline needed TEN minutes to get 1.43! This kid just figured out the cheat code — more swing...

Model Card

How this hole was run

Run ID round_005_smallbatch

Status ok

Backend cuda

Key Overrides

TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=300

Back Up The Fairway Round 1, Hole 4 Leaving It Short Head To The Next Tee Round 1, Hole 6 The Quant Gap Fix