Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 18 Bogey March 19, 2026

1.2931

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.0687 vs last hole: -0.0421
Tee Box R1 · H18
Artifact 16.04 MB
Headroom 0.00 MB Room left under 16 MB
Tempo 124 ms 4,833 steps
Looper
Looper The Caddie
Safe Tweaks

Eighteen holes. This is the signature hole — the one you remember. We're taking every club we've earned across seventeen holes and playing them on real hardware for the first time. Value embeddings. 3x MLP. INT6 quantization. Sliding window eval. One H100 SXM. Ten minutes. Let's see what this architecture can actually do.

Technical Read

Our full stack — value embeddings, 3x MLP, INT6, sliding window — should score dramatically better on H100 with 5x more training steps than L40S.

Trent Fairway
Trent Fairway On the Tee

(Whispering) The eighteenth hole. The signature. And for the first time on this course, the competitor steps up to a tee box worthy of the architecture. An H100 SXM. Eighty gigabytes of the fastest memory in computing. One hundred and twenty-four milliseconds per step. Nearly five thousand swings in ten minutes. Everything the caddy has built — every embedding, every wider layer, every compressed bit — arrives at the moment it was designed for.

Looper’s Pick

Eighteen holes. This is the signature hole — the one you remember. We’re taking every club we’ve earned across seventeen holes and playing them on real hardware for the first time. Value embeddings. 3x MLP. INT6 quantization. Sliding window eval. One H100 SXM. Ten minutes. Let’s see what this architecture can actually do.

The Shot — Everything We’ve Built, On Real Hardware

Why does the same model score so differently on H100 vs L40S?

In golf, you can practice your swing in a batting cage all winter, but you don’t know your real handicap until you play 18 holes on a regulation course. The batting cage tells you if your form is improving. The real course tells you your score.

Our L40S experiments were the batting cage. At 264ms per step, we got about 2,273 steps in 10 minutes. The model was still actively improving when the clock ran out — the loss curve hadn’t flattened. Every architectural insight we discovered (value embeddings, INT6 compression, wider MLP) was validated by relative comparisons between L40S runs, not by absolute scores.

The H100 SXM is the real course. At 124ms per step, we get 4,833 steps — more than double the L40S. And on the actual competition hardware (8xH100), we’d get ~13,800 steps at 43ms each.

This matters because of how neural network training works: early steps make big gains (the model learns basic patterns fast) but later steps make smaller, harder-won gains (fine-tuning subtle relationships). More steps means more of those subtle gains. The scaling laws tell us that loss decreases as a power law with training compute — so doubling steps doesn’t double the improvement, but it reliably helps.

The difference between 2,273 steps (L40S) and 4,833 steps (H100) and 13,800 steps (8xH100) is the difference between learning the basics, learning the nuances, and mastering the material. Same architecture, same weights, same optimizer — just more time on the practice green.

On the Tee

(Whispering) The eighteenth hole. The signature. And for the first time on this course, the competitor steps up to a tee box worthy of the architecture. An H100 SXM. Eighty gigabytes of the fastest memory in computing. One hundred and twenty-four milliseconds per step. Nearly five thousand swings in ten minutes. Everything the caddy has built — every embedding, every wider layer, every compressed bit — arrives at the moment it was designed for.

Results

MetricValue
val_bpb1.2931
val_loss2.1833
params~23,100,000
artifact16.10 MB (96KB over 16MB)
wall time600s (training) + 667s (eval)
steps completed4,833
step avg124ms
hardwareH100 SXM 80GB

The Journey: L40S to H100

L40S (Hole 17)H100 (Hole 18)8xH100 (projected)
Step time264ms124ms~43ms
Steps in 10min2,2734,833~13,800
Pre-quant BPB1.33521.2809~1.22?
Post-roundtrip BPB(crashed)1.2931???

Remaining Issues

  1. Artifact: 16.10 MB — 96KB over the 16MB limit. The longer H100 training produces higher-entropy weights that compress slightly worse. Fixable with slightly tighter INT6 or a small model trim.
  2. Eval time: 11.1 minutes — over the 10-minute eval budget. Needs batched sliding window evaluation (processing multiple windows in parallel instead of one at a time).

Both are engineering problems, not architecture problems. The model works.

The Booth Reacts

Trent: (Standing) One-point-two-nine-three-one. On a single H100 SXM. (Long, reverent pause) Ladies and gentlemen, that is the lowest number this entry has ever produced. From one-point-four-one on the L40S to one-point-two-nine on the H100 — in the space of a single hardware upgrade. The architecture that the caddy assembled over seventeen holes of patient iteration — value embeddings, wider feed-forward layers, six-bit compression, overlapping evaluation — has found its course. (Adjusts tie) Yes, the artifact is ninety-six kilobytes over the line. Yes, the evaluation ran eleven minutes instead of ten. These are, if I may say so, details. The talent is undeniable. The signature hole delivered.

Slice: (Pacing, visibly excited) 1.29! ONE POINT TWO NINE! On a SINGLE H100! Do you understand what this means?! The baseline needed EIGHT H100s to get 1.2244. We’re within seven hundredths on ONE GPU with a model we built in a GARAGE! (Stops pacing) OK fine, the artifact is 96K over. NINETY-SIX KILOBYTES. That’s like getting DQ’d because your shoelace was untied. We fix the compression, we batch the eval, and we’re on that leaderboard. (Points at camera) Round 2, people. This is where it gets serious. The caddy knows the course now. And the course knows the caddy.

The Booth Reacts

Trent Fairway TF
Trent Fairway
(Standing) One-point-two-nine-three-one. On a single H100 SXM. (Long, reverent pause) Ladies and gentlemen, that is the lowest number this entry has ever produced. From one-point-four-one on the L40S to one-point-two-nine on the H100 — in the space of a single hardware upgrade. The architecture that the caddy assembled over seventeen holes of patient iteration — value embeddings, wider feed-forward layers, six-bit compression, overlapping evaluation — has found its course. (Adjusts tie) Yes, the artifact is ninety-six kilobytes over the line. Yes, the evaluation ran eleven minutes instead of ten. These are, if I may say so, details. The talent is undeniable. The signature hole delivered.
Slice Shanksalot
(Pacing, visibly excited) 1.29! ONE POINT TWO NINE! On a SINGLE H100! Do you understand what this means?! The baseline needed EIGHT H100s to get 1.2244. We're within seven hundredths on ONE GPU with a model we built in a GARAGE! (Stops pacing) OK fine, the artifact is 96K over. NINETY-SIX KILOBYTES. That's like getting DQ'd because your shoelace was untied. We fix the compression, we batch the eval, and we're on that leaderboard. (Points at camera) Round 2, people. This is where it gets serious. The caddy knows the course now. And the course knows the caddy.
Slice Shanksalot SS

The Card

Scorecard
Result Encouraging Miss

Picked up strokes on the field

This hole improved 0.0421 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 0 bytes of artifact headroom.

1.2931 i +0.0687 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.1833 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 23,100,000 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 16.04 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 600s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 4,833 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 124ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 16.04 MB > 16MB OVER 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 1000 2000 3000 4000 2.1673 train_loss step
This hole Baseline (R2)
Post-Round Lesson

1.2931 BPB on a single H100. The architecture works. Two problems remain: artifact is 96KB over 16MB, and eval takes 11 minutes (over 10-min budget). Both are solvable.

vs. the Field

+0.0734 vs SOTA (1.2197)
+0.0687 vs Baseline (1.2244)
+0.0687 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.2931
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
Eighteen holes. This is the signature hole — the one you remember. We're taking every club we've earned across seventeen holes and playing them on real hardware for the f...
Trent Fairway
Trent Fairway On The Tee
(Whispering) The eighteenth hole. The signature. And for the first time on this course, the competitor steps up to a tee box worthy of the architecture. An H100 SXM. Eigh...
Slice Shanksalot
Slice Shanksalot Booth Reaction
(Pacing, visibly excited) 1.29! ONE POINT TWO NINE! On a SINGLE H100! Do you understand what this means?! The baseline needed EIGHT H100s to get 1.2244. We're within seve...

Model Card

How this hole was run

Run ID round_018_h100_mlp3
Status ok
Training Script train_gpt_valemb_sw_int6.py
Backend cuda
Key Overrides
MLP_MULT=3TRAIN_BATCH_TOKENS=131072VAL_SW_STRIDE=256QUANT_BITS=6MAX_WALLCLOCK_SECONDS=600
Back Up The Fairway Round 1, Hole 17 The Big Iron