Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 12 Bogey March 19, 2026

1.3394

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1150 vs last hole: -0.0685
Tee Box R1 · H12
Artifact 16.72 MB
Headroom 0.00 MB Room left under 16 MB
Tempo 235 ms 2,552 steps
Looper
Looper The Caddie
Safe Tweaks

The curve was still falling when Hole 10 hit the clock. Before we bolt on more architecture, let's just finish the swing — same clubs, double the time. Ten minutes instead of five. If the loss keeps dropping, the bottleneck is step budget. If it plateaus, we need a different club.

Technical Read

The Hole 10 architecture was still improving when the clock ran out. Doubling the wall time should show how much runway it has.

Trent Fairway
Trent Fairway On the Tee

(Whispering) No new clubs today. The same value embeddings, the same architecture — but double the time on the clock. Ten minutes. Twenty-five hundred steps. The competitor is, in essence, wagering that the model they built in Hole 10 had more to say. The gallery settles in for a longer watch.

Looper’s Pick

The curve was still falling when Hole 10 hit the clock. Before we bolt on more architecture, let’s just finish the swing — same clubs, double the time. Ten minutes instead of five. If the loss keeps dropping, the bottleneck is step budget. If it plateaus, we need a different club.

The Shot — More Steps, Same Architecture

Why is "just train longer" a useful experiment?

In golf, sometimes the right call isn’t a new club — it’s a longer backswing. You had the right form all along; you just weren’t following through.

When we ran Hole 10 for 5 minutes (~1,274 steps), the training loss was still visibly decreasing at the end. This means the model hadn’t converged — it was still learning useful patterns from the data. Cutting it off at 5 minutes was an artificial constraint we imposed for fast iteration, not because the model was done.

By doubling to 10 minutes (~2,550 steps), we answer a critical question: is this architecture step-limited or capacity-limited? If the loss keeps dropping linearly, we’re step-limited — the model has more to learn and just needs more time. If it flattens out, we’re capacity-limited — the model’s 18M parameters have absorbed all they can and we need a bigger or different architecture.

The answer matters for strategy. If step-limited: focus on throughput (faster steps, better batch size, more efficient compute). If capacity-limited: focus on architecture (more params via compression, depth recurrence, etc.).

One caveat: on 8xH100, 10 minutes gives ~13,800 steps at 43ms each. Our L40S at 235ms/step only gets ~2,550 — still far short. So even if this run plateaus, the H100 might not. We’re testing the shape of the curve, not the final score.

On the Tee

(Whispering) No new clubs today. The same value embeddings, the same architecture — but double the time on the clock. Ten minutes. Twenty-five hundred steps. The competitor is, in essence, wagering that the model they built in Hole 10 had more to say. The gallery settles in for a longer watch.

Results

MetricValue
val_bpb1.3394
val_loss2.2615
params~18,380,000
artifact16.72 MB (OVER 16MB limit)
wall time600s
steps completed~2,552
step avg235ms

The Runway Test

Stepsval_bpbDelta from 5-min run
~1,274 (5 min, Hole 10)1.4057
~2,552 (10 min, this hole)1.3394-0.0663

Doubling steps improved BPB by 0.066 — the model was deeply step-limited. The curve was nowhere near plateau.

The Problem

The artifact is 16.72 MB — 720KB over the 16MB competition limit. The value embeddings add ~1.3M params that compress less well than the main weight matrices. We need to either:

The Booth Reacts

Trent: (Eyes widening slightly) One-point-three-three-nine-four. That is a staggering improvement — sixty-six thousandths better than Hole 10, in double the time. The model was, as the caddy suspected, nowhere near finished. The loss curve descended steadily through twenty-five hundred steps with no sign of flattening. On 8xH100, one imagines it would continue for thirteen thousand steps more. (Pause) However. The artifact. Sixteen-point-seven megabytes. Over the limit. A magnificent drive… into the out-of-bounds.

Slice: OK so we just found GOLD but the suitcase is too big for the overhead bin! 1.3394, people! That’s the best number we’ve EVER seen and it’s not even on the real hardware! But we can’t submit it because the model is 700K too fat. (Pacing) This is like qualifying in ‘04 when I shot a 65 but got DQ’d for signing the wrong scorecard. The TALENT is there. The EXECUTION needs work. We’ve got to trim this thing or compress it better. The fp16 embedding fix is sitting right there — that’s 500KB back. DO IT.

The Booth Reacts

Trent Fairway TF
Trent Fairway
(Eyes widening slightly) One-point-three-three-nine-four. That is a staggering improvement — sixty-six thousandths better than Hole 10, in double the time. The model was, as the caddy suspected, nowhere near finished. The loss curve descended steadily through twenty-five hundred steps with no sign of flattening. On 8xH100, one imagines it would continue for thirteen thousand steps more. (Pause) However. The artifact. Sixteen-point-seven megabytes. Over the limit. A magnificent drive... into the out-of-bounds.
Slice Shanksalot
OK so we just found GOLD but the suitcase is too big for the overhead bin! 1.3394, people! That's the best number we've EVER seen and it's not even on the real hardware! But we can't submit it because the model is 700K too fat. (Pacing) This is like qualifying in '04 when I shot a 65 but got DQ'd for signing the wrong scorecard. The TALENT is there. The EXECUTION needs work. We've got to trim this thing or compress it better. The fp16 embedding fix is sitting right there — that's 500KB back. DO IT.
Slice Shanksalot SS

The Card

Scorecard
Result Encouraging Miss

Picked up strokes on the field

This hole improved 0.0685 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 0 bytes of artifact headroom.

1.3394 i +0.1150 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.2615 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 18,380,000 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 16.72 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 600s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 2,552 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 235ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 16.72 MB > 16MB OVER 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 1500 2000 2.3439 train_loss step
This hole Baseline (R2)
Post-Round Lesson

The architecture has massive headroom — 0.066 BPB improvement with double the steps. BUT the artifact hit 16.7MB, over the 16MB limit. Need to either slim the value embeddings or improve compression.

vs. the Field

+0.1197 vs SOTA (1.2197)
+0.1150 vs Baseline (1.2244)
+0.1150 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.3394
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
The curve was still falling when Hole 10 hit the clock. Before we bolt on more architecture, let's just finish the swing — same clubs, double the time. Ten minutes instea...
Trent Fairway
Trent Fairway On The Tee
(Whispering) No new clubs today. The same value embeddings, the same architecture — but double the time on the clock. Ten minutes. Twenty-five hundred steps. The competit...
Slice Shanksalot
Slice Shanksalot Booth Reaction
OK so we just found GOLD but the suitcase is too big for the overhead bin! 1.3394, people! That's the best number we've EVER seen and it's not even on the real hardware!...

Model Card

How this hole was run

Run ID round_012_valemb_10min
Status ok
Training Script train_gpt_valemb.py
Backend cuda
Key Overrides
TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=600
Back Up The Fairway Round 1, Hole 11 Stacking the Gains Head To The Next Tee Round 1, Hole 13 Trimming the Bag