Hole 18: The Signature Hole — Gradient Descent Country Club

Round 1, Hole 18 Bogey March 19, 2026

1.2931

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.0687 vs last hole: -0.0421

Tee Box R1 · H18

Artifact 16.04 MB

Headroom 0.00 MB Room left under 16 MB

Tempo 124 ms 4,833 steps

Looper The Caddie

Safe Tweaks

Eighteen holes. This is the signature hole — the one you remember. We're taking every club we've earned across seventeen holes and playing them on real hardware for the first time. Value embeddings. 3x MLP. INT6 quantization. Sliding window eval. One H100 SXM. Ten minutes. Let's see what this architecture can actually do.

Technical Read

Our full stack — value embeddings, 3x MLP, INT6, sliding window — should score dramatically better on H100 with 5x more training steps than L40S.

Trent Fairway On the Tee

(Whispering) The eighteenth hole. The signature. And for the first time on this course, the competitor steps up to a tee box worthy of the architecture. An H100 SXM. Eighty gigabytes of the fastest memory in computing. One hundred and twenty-four milliseconds per step. Nearly five thousand swings in ten minutes. Everything the caddy has built — every embedding, every wider layer, every compressed bit — arrives at the moment it was designed for.

Looper’s Pick

Eighteen holes. This is the signature hole — the one you remember. We’re taking every club we’ve earned across seventeen holes and playing them on real hardware for the first time. Value embeddings. 3x MLP. INT6 quantization. Sliding window eval. One H100 SXM. Ten minutes. Let’s see what this architecture can actually do.

The Shot — Everything We’ve Built, On Real Hardware

Why does the same model score so differently on H100 vs L40S?

In golf, you can practice your swing in a batting cage all winter, but you don’t know your real handicap until you play 18 holes on a regulation course. The batting cage tells you if your form is improving. The real course tells you your score.

Our L40S experiments were the batting cage. At 264ms per step, we got about 2,273 steps in 10 minutes. The model was still actively improving when the clock ran out — the loss curve hadn’t flattened. Every architectural insight we discovered (value embeddings, INT6 compression, wider MLP) was validated by relative comparisons between L40S runs, not by absolute scores.

The H100 SXM is the real course. At 124ms per step, we get 4,833 steps — more than double the L40S. And on the actual competition hardware (8xH100), we’d get ~13,800 steps at 43ms each.

This matters because of how neural network training works: early steps make big gains (the model learns basic patterns fast) but later steps make smaller, harder-won gains (fine-tuning subtle relationships). More steps means more of those subtle gains. The scaling laws tell us that loss decreases as a power law with training compute — so doubling steps doesn’t double the improvement, but it reliably helps.

The difference between 2,273 steps (L40S) and 4,833 steps (H100) and 13,800 steps (8xH100) is the difference between learning the basics, learning the nuances, and mastering the material. Same architecture, same weights, same optimizer — just more time on the practice green.

On the Tee

Results

Metric	Value
val_bpb	1.2931
val_loss	2.1833
params	~23,100,000
artifact	16.10 MB (96KB over 16MB)
wall time	600s (training) + 667s (eval)
steps completed	4,833
step avg	124ms
hardware	H100 SXM 80GB

The Journey: L40S to H100

	L40S (Hole 17)	H100 (Hole 18)	8xH100 (projected)
Step time	264ms	124ms	~43ms
Steps in 10min	2,273	4,833	~13,800
Pre-quant BPB	1.3352	1.2809	~1.22?
Post-roundtrip BPB	(crashed)	1.2931	???

Remaining Issues

Artifact: 16.10 MB — 96KB over the 16MB limit. The longer H100 training produces higher-entropy weights that compress slightly worse. Fixable with slightly tighter INT6 or a small model trim.
Eval time: 11.1 minutes — over the 10-minute eval budget. Needs batched sliding window evaluation (processing multiple windows in parallel instead of one at a time).

Both are engineering problems, not architecture problems. The model works.

The Booth Reacts

Trent: (Standing) One-point-two-nine-three-one. On a single H100 SXM. (Long, reverent pause) Ladies and gentlemen, that is the lowest number this entry has ever produced. From one-point-four-one on the L40S to one-point-two-nine on the H100 — in the space of a single hardware upgrade. The architecture that the caddy assembled over seventeen holes of patient iteration — value embeddings, wider feed-forward layers, six-bit compression, overlapping evaluation — has found its course. (Adjusts tie) Yes, the artifact is ninety-six kilobytes over the line. Yes, the evaluation ran eleven minutes instead of ten. These are, if I may say so, details. The talent is undeniable. The signature hole delivered.

Slice: (Pacing, visibly excited) 1.29! ONE POINT TWO NINE! On a SINGLE H100! Do you understand what this means?! The baseline needed EIGHT H100s to get 1.2244. We’re within seven hundredths on ONE GPU with a model we built in a GARAGE! (Stops pacing) OK fine, the artifact is 96K over. NINETY-SIX KILOBYTES. That’s like getting DQ’d because your shoelace was untied. We fix the compression, we batch the eval, and we’re on that leaderboard. (Points at camera) Round 2, people. This is where it gets serious. The caddy knows the course now. And the course knows the caddy.

The Booth Reacts

Trent Fairway

(Standing) One-point-two-nine-three-one. On a single H100 SXM. (Long, reverent pause) Ladies and gentlemen, that is the lowest number this entry has ever produced. From one-point-four-one on the L40S to one-point-two-nine on the H100 — in the space of a single hardware upgrade. The architecture that the caddy assembled over seventeen holes of patient iteration — value embeddings, wider feed-forward layers, six-bit compression, overlapping evaluation — has found its course. (Adjusts tie) Yes, the artifact is ninety-six kilobytes over the line. Yes, the evaluation ran eleven minutes instead of ten. These are, if I may say so, details. The talent is undeniable. The signature hole delivered.

Slice Shanksalot

(Pacing, visibly excited) 1.29! ONE POINT TWO NINE! On a SINGLE H100! Do you understand what this means?! The baseline needed EIGHT H100s to get 1.2244. We're within seven hundredths on ONE GPU with a model we built in a GARAGE! (Stops pacing) OK fine, the artifact is 96K over. NINETY-SIX KILOBYTES. That's like getting DQ'd because your shoelace was untied. We fix the compression, we batch the eval, and we're on that leaderboard. (Points at camera) Round 2, people. This is where it gets serious. The caddy knows the course now. And the course knows the caddy.

The Card

Scorecard

Result Encouraging Miss

Picked up strokes on the field

This hole improved 0.0421 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 0 bytes of artifact headroom.

1.2931 i +0.0687 val bpb 2.1833 i val loss 23,100,000 i params 16.04 MB i artifact 600s i wall time 4,833 i steps 124ms i step avg

0 MB 16.04 MB > 16MB OVER 16 MB

Training Curve

This hole Baseline (R2)

Post-Round Lesson

1.2931 BPB on a single H100. The architecture works. Two problems remain: artifact is 96KB over 16MB, and eval takes 11 minutes (over 10-min budget). Both are solvable.

vs. the Field

+0.0734 vs SOTA (1.2197)

+0.0687 vs Baseline (1.2244)

+0.0687 vs Our Best (1.2244)

SOTA
1.2197

Baseline
1.2244

Our Best
1.2244

This Hole
1.2931

← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper Caddie Notebook

Eighteen holes. This is the signature hole — the one you remember. We're taking every club we've earned across seventeen holes and playing them on real hardware for the f...

Trent Fairway On The Tee

(Whispering) The eighteenth hole. The signature. And for the first time on this course, the competitor steps up to a tee box worthy of the architecture. An H100 SXM. Eigh...

Slice Shanksalot Booth Reaction

(Pacing, visibly excited) 1.29! ONE POINT TWO NINE! On a SINGLE H100! Do you understand what this means?! The baseline needed EIGHT H100s to get 1.2244. We're within seve...

Model Card

How this hole was run

Run ID round_018_h100_mlp3

Status ok

Training Script train_gpt_valemb_sw_int6.py

Backend cuda

Key Overrides

MLP_MULT=3TRAIN_BATCH_TOKENS=131072VAL_SW_STRIDE=256QUANT_BITS=6MAX_WALLCLOCK_SECONDS=600

Back Up The Fairway Round 1, Hole 17 The Big Iron