Hole 9: The Short Game — Gradient Descent Country Club

Round 1, Hole 9 Triple Bogey March 19, 2026

1.4425

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.2181 vs last hole: +0.0155

Tee Box R1 · H9

Artifact 13.53 MB

Headroom 2.47 MB Room left under 16 MB

Tempo 224 ms 1,340 steps

Looper The Caddie

Calculated Risk

The playbook says "throughput first." Attention cost is quadratic in sequence length — cut it in half and each step gets cheaper. TRAIN_SEQ_LEN=512 instead of 1024. The model sees shorter documents during training but the same amount of text per step. If the speed gain is big enough, the extra steps could compensate for the lost long-range context. The evaluation still runs at full length.

Technical Read

Halving training context should make attention cheap enough to buy meaningful extra steps without wrecking eval.

Trent Fairway On the Tee

(Whispering) The caddy has made an unusual recommendation today. Half the context length. Five hundred and twelve tokens where there were a thousand and twenty-four. The competitor can only see... half the fairway, as it were. One hopes the putting compensates.

Looper’s Pick

The playbook says “throughput first.” Attention cost is quadratic in sequence length — cut it in half and each step gets cheaper. TRAIN_SEQ_LEN=512 instead of 1024. The model sees shorter documents during training but the same amount of text per step. If the speed gain is big enough, the extra steps could compensate for the lost long-range context. The evaluation still runs at full length.

The Shot — Shorter Training Context

Why would training on shorter text sequences help?

In golf, there’s a school of thought that says “train short, play long.” Practice your 50-yard chips and your putting, and the full-length game will follow. The logic: short-range precision translates to long-range performance, and you get more reps per hour on the practice green than on the driving range.

The same idea applies to transformer training. The self-attention mechanism compares every token against every previous token, which means its compute cost grows quadratically with sequence length. A 1024-token sequence requires 4× the attention computation of a 512-token sequence. If we train at 512 tokens, each step is cheaper, and we get more steps in our wall clock budget.

The trade-off: shorter sequences mean the model never sees dependencies longer than 512 tokens during training. It can’t learn that a pronoun on token 800 refers to a noun on token 200. When we evaluate on full-length documents, the model has to extrapolate beyond what it was trained on. RoPE (rotary position embeddings) is designed to handle some extrapolation, but it’s not perfect.

For this experiment, the results were clear: step time barely improved (224ms vs 229ms — only 2% faster) because at our small batch size of 131K tokens, attention isn’t the bottleneck. The matrix multiplications in the MLP and attention projections dominate, and those don’t depend on sequence length. Meanwhile, the quality loss from shorter context was real: 1.4425 vs 1.4139 BPB.

The lesson: “train short, play long” only works if “train short” actually saves meaningful time. At our batch size, it doesn’t.

On the Tee

(Whispering) The caddy has made an unusual recommendation today. Half the context length. Five hundred and twelve tokens where there were a thousand and twenty-four. The competitor can only see… half the fairway, as it were. One hopes the putting compensates.

Results

Metric	Value
val_bpb	1.4425
val_loss	2.4357
params	17,059,912
artifact	13.53 MB (yes < 16MB)
wall time	300s
steps completed	~1,340
step avg	224ms

vs Hole 5 (SEQ_LEN=1024)

	Hole 5 (seq=1024)	Hole 9 (seq=512)
val_bpb	1.4139	1.4425
step avg	229ms	224ms (-2%)
steps	1,309	~1,340 (+2%)

Barely faster, noticeably worse. Context matters more than throughput at this batch size.

The Booth Reacts

Trent: One-point-four-four-two-five. (Slight wince) That is a regression of nearly three hundredths from our best. The shorter training context has, I’m afraid, done rather more harm than good. The step time improved by a mere two percent — hardly the windfall the caddy had anticipated. The model, it would seem, genuinely needs to see a thousand tokens of context to do its best work. A case of the short game letting down the long game, rather than the reverse.

Slice: Two percent faster?! TWO?! We chopped the context in HALF for two lousy percent?! Boss, that’s like… that’s like taking a shortcut through the woods and saving ten seconds while losing three balls. The attention wasn’t even the bottleneck! We already fixed the throughput problem in Hole 5 with the small batch. At 131K tokens per step, the sequence length doesn’t matter for speed — it’s all matrix multiplies either way. This was a dead end and I could’ve told you that if anyone had ASKED me. (Crosses arms)

Trent: (Adjusting glasses) To be fair, one needed to verify the hypothesis empirically. And now we know.

The Booth Reacts

Trent Fairway

One-point-four-four-two-five. (Slight wince) That is a regression of nearly three hundredths from our best. The shorter training context has, I'm afraid, done rather more harm than good. The step time improved by a mere two percent — hardly the windfall the caddy had anticipated. The model, it would seem, genuinely needs to see a thousand tokens of context to do its best work. A case of the short game letting down the long game, rather than the reverse.

Slice Shanksalot

Two percent faster?! TWO?! We chopped the context in HALF for two lousy percent?! Boss, that's like... that's like taking a shortcut through the woods and saving ten seconds while losing three balls. The attention wasn't even the bottleneck! We already fixed the throughput problem in Hole 5 with the small batch. At 131K tokens per step, the sequence length doesn't matter for speed — it's all matrix multiplies either way. This was a dead end and I could've told you that if anyone had ASKED me. (Crosses arms)

Trent Fairway

(Adjusting glasses) To be fair, one needed to verify the hypothesis empirically. And now we know.

The Card

Scorecard

Result Dead End

Dropped a shot versus the last hole

This hole lost 0.0155 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 2,473,297 bytes of artifact headroom.

1.4425 i +0.2181 val bpb 2.4357 i val loss 17,059,912 i params 13.53 MB i artifact 300s i wall time 1,340 i steps 224ms i step avg

0 MB 13.53 MB < 16MB limit 16 MB

Training Curve

This hole Baseline (R2)

Post-Round Lesson

At this batch size, context length was not the real bottleneck. We paid the quality cost without getting enough speed back.

vs. the Field

+0.2228 vs SOTA (1.2197)

+0.2181 vs Baseline (1.2244)

+0.2181 vs Our Best (1.2244)

SOTA
1.2197

Baseline
1.2244

Our Best
1.2244

This Hole
1.4425

← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper Caddie Notebook

The playbook says "throughput first." Attention cost is quadratic in sequence length — cut it in half and each step gets cheaper. TRAIN_SEQ_LEN=512 instead of 1024. The m...

Trent Fairway On The Tee

(Whispering) The caddy has made an unusual recommendation today. Half the context length. Five hundred and twelve tokens where there were a thousand and twenty-four. The...

Slice Shanksalot Booth Reaction

Two percent faster?! TWO?! We chopped the context in HALF for two lousy percent?! Boss, that's like... that's like taking a shortcut through the woods and saving ten seco...

Model Card

How this hole was run

Run ID round_009_seq512

Status ok

Backend cuda

Key Overrides

TRAIN_SEQ_LEN=512TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=300

Back Up The Fairway Round 1, Hole 8 The Skinny Iron Head To The Next Tee Round 1, Hole 10 The Extra Set of Eyes