Hole 12: Finishing the Swing — Gradient Descent Country Club

Round 1, Hole 12 Bogey March 19, 2026

1.3394

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1150 vs last hole: -0.0685

Tee Box R1 · H12

Artifact 16.72 MB

Headroom 0.00 MB Room left under 16 MB

Tempo 235 ms 2,552 steps

Looper The Caddie

Safe Tweaks

The curve was still falling when Hole 10 hit the clock. Before we bolt on more architecture, let's just finish the swing — same clubs, double the time. Ten minutes instead of five. If the loss keeps dropping, the bottleneck is step budget. If it plateaus, we need a different club.

Technical Read

The Hole 10 architecture was still improving when the clock ran out. Doubling the wall time should show how much runway it has.

Trent Fairway On the Tee

(Whispering) No new clubs today. The same value embeddings, the same architecture — but double the time on the clock. Ten minutes. Twenty-five hundred steps. The competitor is, in essence, wagering that the model they built in Hole 10 had more to say. The gallery settles in for a longer watch.

Looper’s Pick

The curve was still falling when Hole 10 hit the clock. Before we bolt on more architecture, let’s just finish the swing — same clubs, double the time. Ten minutes instead of five. If the loss keeps dropping, the bottleneck is step budget. If it plateaus, we need a different club.

The Shot — More Steps, Same Architecture

Why is "just train longer" a useful experiment?

In golf, sometimes the right call isn’t a new club — it’s a longer backswing. You had the right form all along; you just weren’t following through.

When we ran Hole 10 for 5 minutes (~1,274 steps), the training loss was still visibly decreasing at the end. This means the model hadn’t converged — it was still learning useful patterns from the data. Cutting it off at 5 minutes was an artificial constraint we imposed for fast iteration, not because the model was done.

By doubling to 10 minutes (~2,550 steps), we answer a critical question: is this architecture step-limited or capacity-limited? If the loss keeps dropping linearly, we’re step-limited — the model has more to learn and just needs more time. If it flattens out, we’re capacity-limited — the model’s 18M parameters have absorbed all they can and we need a bigger or different architecture.

The answer matters for strategy. If step-limited: focus on throughput (faster steps, better batch size, more efficient compute). If capacity-limited: focus on architecture (more params via compression, depth recurrence, etc.).

One caveat: on 8xH100, 10 minutes gives ~13,800 steps at 43ms each. Our L40S at 235ms/step only gets ~2,550 — still far short. So even if this run plateaus, the H100 might not. We’re testing the shape of the curve, not the final score.

On the Tee

Results

Metric	Value
val_bpb	1.3394
val_loss	2.2615
params	~18,380,000
artifact	16.72 MB (OVER 16MB limit)
wall time	600s
steps completed	~2,552
step avg	235ms

The Runway Test

Steps	val_bpb	Delta from 5-min run
~1,274 (5 min, Hole 10)	1.4057	—
~2,552 (10 min, this hole)	1.3394	-0.0663

Doubling steps improved BPB by 0.066 — the model was deeply step-limited. The curve was nowhere near plateau.

The Problem

The artifact is 16.72 MB — 720KB over the 16MB competition limit. The value embeddings add ~1.3M params that compress less well than the main weight matrices. We need to either:

Slim the value embeddings (fewer tables or smaller dimension)
Apply the fp16 embedding export fix (saves ~500KB)
Use QAT for better compression
Reduce another part of the model to make room

The Booth Reacts

Trent: (Eyes widening slightly) One-point-three-three-nine-four. That is a staggering improvement — sixty-six thousandths better than Hole 10, in double the time. The model was, as the caddy suspected, nowhere near finished. The loss curve descended steadily through twenty-five hundred steps with no sign of flattening. On 8xH100, one imagines it would continue for thirteen thousand steps more. (Pause) However. The artifact. Sixteen-point-seven megabytes. Over the limit. A magnificent drive… into the out-of-bounds.

Slice: OK so we just found GOLD but the suitcase is too big for the overhead bin! 1.3394, people! That’s the best number we’ve EVER seen and it’s not even on the real hardware! But we can’t submit it because the model is 700K too fat. (Pacing) This is like qualifying in ‘04 when I shot a 65 but got DQ’d for signing the wrong scorecard. The TALENT is there. The EXECUTION needs work. We’ve got to trim this thing or compress it better. The fp16 embedding fix is sitting right there — that’s 500KB back. DO IT.

The Booth Reacts

Trent Fairway

(Eyes widening slightly) One-point-three-three-nine-four. That is a staggering improvement — sixty-six thousandths better than Hole 10, in double the time. The model was, as the caddy suspected, nowhere near finished. The loss curve descended steadily through twenty-five hundred steps with no sign of flattening. On 8xH100, one imagines it would continue for thirteen thousand steps more. (Pause) However. The artifact. Sixteen-point-seven megabytes. Over the limit. A magnificent drive... into the out-of-bounds.

Slice Shanksalot

OK so we just found GOLD but the suitcase is too big for the overhead bin! 1.3394, people! That's the best number we've EVER seen and it's not even on the real hardware! But we can't submit it because the model is 700K too fat. (Pacing) This is like qualifying in '04 when I shot a 65 but got DQ'd for signing the wrong scorecard. The TALENT is there. The EXECUTION needs work. We've got to trim this thing or compress it better. The fp16 embedding fix is sitting right there — that's 500KB back. DO IT.

The Card

Scorecard

Result Encouraging Miss

Picked up strokes on the field

This hole improved 0.0685 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 0 bytes of artifact headroom.

1.3394 i +0.1150 val bpb 2.2615 i val loss 18,380,000 i params 16.72 MB i artifact 600s i wall time 2,552 i steps 235ms i step avg

0 MB 16.72 MB > 16MB OVER 16 MB

Training Curve

This hole Baseline (R2)

Post-Round Lesson

The architecture has massive headroom — 0.066 BPB improvement with double the steps. BUT the artifact hit 16.7MB, over the 16MB limit. Need to either slim the value embeddings or improve compression.

vs. the Field

+0.1197 vs SOTA (1.2197)

+0.1150 vs Baseline (1.2244)

+0.1150 vs Our Best (1.2244)

SOTA
1.2197

Baseline
1.2244

Our Best
1.2244

This Hole
1.3394

← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper Caddie Notebook

The curve was still falling when Hole 10 hit the clock. Before we bolt on more architecture, let's just finish the swing — same clubs, double the time. Ten minutes instea...

Trent Fairway On The Tee

(Whispering) No new clubs today. The same value embeddings, the same architecture — but double the time on the clock. Ten minutes. Twenty-five hundred steps. The competit...

Slice Shanksalot Booth Reaction

OK so we just found GOLD but the suitcase is too big for the overhead bin! 1.3394, people! That's the best number we've EVER seen and it's not even on the real hardware!...

Model Card

How this hole was run

Run ID round_012_valemb_10min

Status ok

Training Script train_gpt_valemb.py

Backend cuda

Key Overrides

TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=600

Back Up The Fairway Round 1, Hole 11 Stacking the Gains Head To The Next Tee Round 1, Hole 13 Trimming the Bag