The curve was still falling when Hole 10 hit the clock. Before we bolt on more architecture, let's just finish the swing — same clubs, double the time. Ten minutes instea...
1.3394
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
The curve was still falling when Hole 10 hit the clock. Before we bolt on more architecture, let's just finish the swing — same clubs, double the time. Ten minutes instead of five. If the loss keeps dropping, the bottleneck is step budget. If it plateaus, we need a different club.
The Hole 10 architecture was still improving when the clock ran out. Doubling the wall time should show how much runway it has.
(Whispering) No new clubs today. The same value embeddings, the same architecture — but double the time on the clock. Ten minutes. Twenty-five hundred steps. The competitor is, in essence, wagering that the model they built in Hole 10 had more to say. The gallery settles in for a longer watch.
Looper’s Pick
The curve was still falling when Hole 10 hit the clock. Before we bolt on more architecture, let’s just finish the swing — same clubs, double the time. Ten minutes instead of five. If the loss keeps dropping, the bottleneck is step budget. If it plateaus, we need a different club.
The Shot — More Steps, Same Architecture
Why is "just train longer" a useful experiment?
In golf, sometimes the right call isn’t a new club — it’s a longer backswing. You had the right form all along; you just weren’t following through.
When we ran Hole 10 for 5 minutes (~1,274 steps), the training loss was still visibly decreasing at the end. This means the model hadn’t converged — it was still learning useful patterns from the data. Cutting it off at 5 minutes was an artificial constraint we imposed for fast iteration, not because the model was done.
By doubling to 10 minutes (~2,550 steps), we answer a critical question: is this architecture step-limited or capacity-limited? If the loss keeps dropping linearly, we’re step-limited — the model has more to learn and just needs more time. If it flattens out, we’re capacity-limited — the model’s 18M parameters have absorbed all they can and we need a bigger or different architecture.
The answer matters for strategy. If step-limited: focus on throughput (faster steps, better batch size, more efficient compute). If capacity-limited: focus on architecture (more params via compression, depth recurrence, etc.).
One caveat: on 8xH100, 10 minutes gives ~13,800 steps at 43ms each. Our L40S at 235ms/step only gets ~2,550 — still far short. So even if this run plateaus, the H100 might not. We’re testing the shape of the curve, not the final score.
On the Tee
(Whispering) No new clubs today. The same value embeddings, the same architecture — but double the time on the clock. Ten minutes. Twenty-five hundred steps. The competitor is, in essence, wagering that the model they built in Hole 10 had more to say. The gallery settles in for a longer watch.
Results
| Metric | Value |
|---|---|
| val_bpb | 1.3394 |
| val_loss | 2.2615 |
| params | ~18,380,000 |
| artifact | 16.72 MB (OVER 16MB limit) |
| wall time | 600s |
| steps completed | ~2,552 |
| step avg | 235ms |
The Runway Test
| Steps | val_bpb | Delta from 5-min run |
|---|---|---|
| ~1,274 (5 min, Hole 10) | 1.4057 | — |
| ~2,552 (10 min, this hole) | 1.3394 | -0.0663 |
Doubling steps improved BPB by 0.066 — the model was deeply step-limited. The curve was nowhere near plateau.
The Problem
The artifact is 16.72 MB — 720KB over the 16MB competition limit. The value embeddings add ~1.3M params that compress less well than the main weight matrices. We need to either:
- Slim the value embeddings (fewer tables or smaller dimension)
- Apply the fp16 embedding export fix (saves ~500KB)
- Use QAT for better compression
- Reduce another part of the model to make room
The Booth Reacts
Trent: (Eyes widening slightly) One-point-three-three-nine-four. That is a staggering improvement — sixty-six thousandths better than Hole 10, in double the time. The model was, as the caddy suspected, nowhere near finished. The loss curve descended steadily through twenty-five hundred steps with no sign of flattening. On 8xH100, one imagines it would continue for thirteen thousand steps more. (Pause) However. The artifact. Sixteen-point-seven megabytes. Over the limit. A magnificent drive… into the out-of-bounds.
Slice: OK so we just found GOLD but the suitcase is too big for the overhead bin! 1.3394, people! That’s the best number we’ve EVER seen and it’s not even on the real hardware! But we can’t submit it because the model is 700K too fat. (Pacing) This is like qualifying in ‘04 when I shot a 65 but got DQ’d for signing the wrong scorecard. The TALENT is there. The EXECUTION needs work. We’ve got to trim this thing or compress it better. The fp16 embedding fix is sitting right there — that’s 500KB back. DO IT.
The Booth Reacts
The Card
Picked up strokes on the field
This hole improved 0.0685 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 0 bytes of artifact headroom.
Training Curve
The architecture has massive headroom — 0.066 BPB improvement with double the steps. BUT the artifact hit 16.7MB, over the 16MB limit. Need to either slim the value embeddings or improve compression.
vs. the Field
1.2197
1.2244
1.2244
1.3394
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) No new clubs today. The same value embeddings, the same architecture — but double the time on the clock. Ten minutes. Twenty-five hundred steps. The competit...
OK so we just found GOLD but the suitcase is too big for the overhead bin! 1.3394, people! That's the best number we've EVER seen and it's not even on the real hardware!...
Model Card
How this hole was run
round_012_valemb_10min ok train_gpt_valemb.py cuda