Boss, we've been thinking about this wrong. We're not step-limited because the model is slow — we're step-limited because we're stuffing 524K tokens into every step. That...
1.4139
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
Boss, we've been thinking about this wrong. We're not step-limited because the model is slow — we're step-limited because we're stuffing 524K tokens into every step. That's a big, expensive swing that takes a full second. What if we quarter the batch? 131K tokens per step. Each step is noisier, sure, but we get four times as many swings. On a GPU where wall time is everything, I'd rather take 1300 imperfect swings than 300 perfect ones.
On slower hardware, quartering the batch will buy enough extra optimizer steps to outweigh the noisier gradient.
(Whispering) A rather unconventional approach from the caddy here. Instead of the full driver — half a million tokens per swing — the competitor has elected for a... well, one might call it a controlled punch shot. One hundred and thirty-one thousand tokens. The gallery seems skeptical. But the caddy appears to know something the rest of us do not.
Looper’s Pick
Boss, we’ve been thinking about this wrong. We’re not step-limited because the model is slow — we’re step-limited because we’re stuffing 524K tokens into every step. That’s a big, expensive swing that takes a full second. What if we quarter the batch? 131K tokens per step. Each step is noisier, sure, but we get four times as many swings. On a GPU where wall time is everything, I’d rather take 1300 imperfect swings than 300 perfect ones.
The Shot — Batch Size vs Step Count
Why would seeing less data per step lead to a better model?
In golf, there are two schools of thought on the driving range. Some players take 50 careful, deliberate swings with full analysis between each one. Others bang out 200 quick swings, relying on muscle memory and volume to groove the motion. Neither approach is universally better — it depends on the time you have.
In neural network training, each “step” processes a batch of data, computes how wrong the model’s predictions are (the loss), and updates the weights accordingly. A larger batch gives you a more accurate estimate of the true gradient — like taking a careful, well-aimed swing. A smaller batch gives you a noisier estimate — more like a quick hack at the range.
The key insight: on our L40S GPU, wall time is the constraint, not data quality. The baseline processes 524,288 tokens per step, taking about 1 second per step. That gives us ~300 steps in 5 minutes. By quartering the batch to 131,072 tokens, each step takes only ~229 milliseconds — about 4.4x faster. In the same 5 minutes, we get 1,309 steps instead of 300.
The gradient from each small batch is noisier, yes. But the Muon optimizer’s Newton-Schulz orthogonalization actually helps here: it normalizes the gradient update regardless of magnitude, which provides a natural smoothing effect that partially compensates for batch noise.
The results speak for themselves. At step 200, the small batch has a higher train loss (3.08 vs 2.74) — each individual step is less precise. But by step 1309, the model has converged to 1.4139 BPB, beating the full-batch 10-minute run’s 1.4361 in half the time. The volume of updates overwhelmed the per-step quality disadvantage.
This principle is well-known in the optimization literature. There’s a “critical batch size” below which you’re wasting data (too noisy to learn) and above which you’re wasting compute (diminishing returns per token). The default 524K batch was tuned for 8xH100 throughput. On our single L40S, the optimal batch size is much smaller.
Bonus discovery: Memory usage dropped from 10.2 GB to 2.8 GB. The smaller batch barely touches the L40S’s 48GB VRAM, meaning we have enormous headroom for larger models later.
On the Tee
(Whispering) A rather unconventional approach from the caddy here. Instead of the full driver — half a million tokens per swing — the competitor has elected for a… well, one might call it a controlled punch shot. One hundred and thirty-one thousand tokens. The gallery seems skeptical. But the caddy appears to know something the rest of us do not.
Results
| Metric | Value |
|---|---|
| val_bpb | 1.4139 |
| val_loss | 2.3873 |
| params | 17,059,912 |
| artifact | 13.35 MB (yes < 16MB) |
| wall time | 300s |
| steps completed | 1,309 / 20,000 |
| step avg | 229ms |
| peak memory | 2,798 MiB |
Head-to-Head vs Baseline (same 5-minute wall time)
| Baseline (H2, 10min) | Small Batch (H5, 5min) | |
|---|---|---|
| Batch tokens | 524,288 | 131,072 |
| Step time | 1,002ms | 229ms |
| Steps in 5min | ~300 | 1,309 |
| val_bpb | 1.4361 (at 10min) | 1.4139 (at 5min) |
Training Curve
| Step | Loss | Avg Step |
|---|---|---|
| 200 | 3.0822 | 229.3ms |
| 400 | 2.8081 | 229.2ms |
| 600 | 2.6599 | 229.2ms |
| 800 | 2.5163 | 229.7ms |
| 1000 | 2.4391 | 229.5ms |
| 1200 | 2.3691 | 229.4ms |
| 1309 | — (val: 2.3857) | 229.4ms |
The Booth Reacts
Trent: (Leaning forward, genuinely surprised) Well I… I must confess I did not see that coming. One-point-four-one-three-nine. In five minutes. That is better than the ten-minute baseline run. The caddy has done something rather clever here — trading precision for volume. It’s the cricketing equivalent of playing for singles instead of boundaries. Unglamorous, but the scoreboard does not lie.
Slice: (Standing up from chair) DID EVERYBODY SEE THAT?! 1.41 in FIVE MINUTES! The baseline needed TEN minutes to get 1.43! This kid just figured out the cheat code — more swings, less thinking! And look at the memory — 2.8 gigs! We went from using a DUMP TRUCK to a GOLF CART and it’s FASTER! When I was qualifying in ‘04 I would’ve KILLED for this kind of insight. You know what the real play is here? We run this batch size with a LONGER wall time and we’re going to the MOON.
Trent: (Adjusting earpiece) I’m told the caddy is already preparing the next club selection. Something about… a larger vocabulary? We shall see.
The Booth Reacts
The Card
Picked up strokes on the field
This hole improved 0.0222 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 2,648,080 bytes of artifact headroom.
Training Curve
This was the first clean win. More swings mattered far more than prettier individual updates.
vs. the Field
1.2197
1.2244
1.2244
1.4139
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) A rather unconventional approach from the caddy here. Instead of the full driver — half a million tokens per swing — the competitor has elected for a... well...
(Standing up from chair) DID EVERYBODY SEE THAT?! 1.41 in FIVE MINUTES! The baseline needed TEN minutes to get 1.43! This kid just figured out the cheat code — more swing...
Model Card
How this hole was run
round_005_smallbatch ok cuda