Eighteen holes. This is the signature hole — the one you remember. We're taking every club we've earned across seventeen holes and playing them on real hardware for the f...
1.2931
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
Eighteen holes. This is the signature hole — the one you remember. We're taking every club we've earned across seventeen holes and playing them on real hardware for the first time. Value embeddings. 3x MLP. INT6 quantization. Sliding window eval. One H100 SXM. Ten minutes. Let's see what this architecture can actually do.
Our full stack — value embeddings, 3x MLP, INT6, sliding window — should score dramatically better on H100 with 5x more training steps than L40S.
(Whispering) The eighteenth hole. The signature. And for the first time on this course, the competitor steps up to a tee box worthy of the architecture. An H100 SXM. Eighty gigabytes of the fastest memory in computing. One hundred and twenty-four milliseconds per step. Nearly five thousand swings in ten minutes. Everything the caddy has built — every embedding, every wider layer, every compressed bit — arrives at the moment it was designed for.
Looper’s Pick
Eighteen holes. This is the signature hole — the one you remember. We’re taking every club we’ve earned across seventeen holes and playing them on real hardware for the first time. Value embeddings. 3x MLP. INT6 quantization. Sliding window eval. One H100 SXM. Ten minutes. Let’s see what this architecture can actually do.
The Shot — Everything We’ve Built, On Real Hardware
Why does the same model score so differently on H100 vs L40S?
In golf, you can practice your swing in a batting cage all winter, but you don’t know your real handicap until you play 18 holes on a regulation course. The batting cage tells you if your form is improving. The real course tells you your score.
Our L40S experiments were the batting cage. At 264ms per step, we got about 2,273 steps in 10 minutes. The model was still actively improving when the clock ran out — the loss curve hadn’t flattened. Every architectural insight we discovered (value embeddings, INT6 compression, wider MLP) was validated by relative comparisons between L40S runs, not by absolute scores.
The H100 SXM is the real course. At 124ms per step, we get 4,833 steps — more than double the L40S. And on the actual competition hardware (8xH100), we’d get ~13,800 steps at 43ms each.
This matters because of how neural network training works: early steps make big gains (the model learns basic patterns fast) but later steps make smaller, harder-won gains (fine-tuning subtle relationships). More steps means more of those subtle gains. The scaling laws tell us that loss decreases as a power law with training compute — so doubling steps doesn’t double the improvement, but it reliably helps.
The difference between 2,273 steps (L40S) and 4,833 steps (H100) and 13,800 steps (8xH100) is the difference between learning the basics, learning the nuances, and mastering the material. Same architecture, same weights, same optimizer — just more time on the practice green.
On the Tee
(Whispering) The eighteenth hole. The signature. And for the first time on this course, the competitor steps up to a tee box worthy of the architecture. An H100 SXM. Eighty gigabytes of the fastest memory in computing. One hundred and twenty-four milliseconds per step. Nearly five thousand swings in ten minutes. Everything the caddy has built — every embedding, every wider layer, every compressed bit — arrives at the moment it was designed for.
Results
| Metric | Value |
|---|---|
| val_bpb | 1.2931 |
| val_loss | 2.1833 |
| params | ~23,100,000 |
| artifact | 16.10 MB (96KB over 16MB) |
| wall time | 600s (training) + 667s (eval) |
| steps completed | 4,833 |
| step avg | 124ms |
| hardware | H100 SXM 80GB |
The Journey: L40S to H100
| L40S (Hole 17) | H100 (Hole 18) | 8xH100 (projected) | |
|---|---|---|---|
| Step time | 264ms | 124ms | ~43ms |
| Steps in 10min | 2,273 | 4,833 | ~13,800 |
| Pre-quant BPB | 1.3352 | 1.2809 | ~1.22? |
| Post-roundtrip BPB | (crashed) | 1.2931 | ??? |
Remaining Issues
- Artifact: 16.10 MB — 96KB over the 16MB limit. The longer H100 training produces higher-entropy weights that compress slightly worse. Fixable with slightly tighter INT6 or a small model trim.
- Eval time: 11.1 minutes — over the 10-minute eval budget. Needs batched sliding window evaluation (processing multiple windows in parallel instead of one at a time).
Both are engineering problems, not architecture problems. The model works.
The Booth Reacts
Trent: (Standing) One-point-two-nine-three-one. On a single H100 SXM. (Long, reverent pause) Ladies and gentlemen, that is the lowest number this entry has ever produced. From one-point-four-one on the L40S to one-point-two-nine on the H100 — in the space of a single hardware upgrade. The architecture that the caddy assembled over seventeen holes of patient iteration — value embeddings, wider feed-forward layers, six-bit compression, overlapping evaluation — has found its course. (Adjusts tie) Yes, the artifact is ninety-six kilobytes over the line. Yes, the evaluation ran eleven minutes instead of ten. These are, if I may say so, details. The talent is undeniable. The signature hole delivered.
Slice: (Pacing, visibly excited) 1.29! ONE POINT TWO NINE! On a SINGLE H100! Do you understand what this means?! The baseline needed EIGHT H100s to get 1.2244. We’re within seven hundredths on ONE GPU with a model we built in a GARAGE! (Stops pacing) OK fine, the artifact is 96K over. NINETY-SIX KILOBYTES. That’s like getting DQ’d because your shoelace was untied. We fix the compression, we batch the eval, and we’re on that leaderboard. (Points at camera) Round 2, people. This is where it gets serious. The caddy knows the course now. And the course knows the caddy.
The Booth Reacts
The Card
Picked up strokes on the field
This hole improved 0.0421 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 0 bytes of artifact headroom.
Training Curve
1.2931 BPB on a single H100. The architecture works. Two problems remain: artifact is 96KB over 16MB, and eval takes 11 minutes (over 10-min budget). Both are solvable.
vs. the Field
1.2197
1.2244
1.2244
1.2931
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) The eighteenth hole. The signature. And for the first time on this course, the competitor steps up to a tee box worthy of the architecture. An H100 SXM. Eigh...
(Pacing, visibly excited) 1.29! ONE POINT TWO NINE! On a SINGLE H100! Do you understand what this means?! The baseline needed EIGHT H100s to get 1.2244. We're within seve...
Model Card
How this hole was run
round_018_h100_mlp3 ok train_gpt_valemb_sw_int6.py cuda