Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 9 Triple Bogey March 19, 2026

1.4425

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.2181 vs last hole: +0.0155
Tee Box R1 · H9
Artifact 13.53 MB
Headroom 2.47 MB Room left under 16 MB
Tempo 224 ms 1,340 steps
Looper
Looper The Caddie
Calculated Risk

The playbook says "throughput first." Attention cost is quadratic in sequence length — cut it in half and each step gets cheaper. TRAIN_SEQ_LEN=512 instead of 1024. The model sees shorter documents during training but the same amount of text per step. If the speed gain is big enough, the extra steps could compensate for the lost long-range context. The evaluation still runs at full length.

Technical Read

Halving training context should make attention cheap enough to buy meaningful extra steps without wrecking eval.

Trent Fairway
Trent Fairway On the Tee

(Whispering) The caddy has made an unusual recommendation today. Half the context length. Five hundred and twelve tokens where there were a thousand and twenty-four. The competitor can only see... half the fairway, as it were. One hopes the putting compensates.

Looper’s Pick

The playbook says “throughput first.” Attention cost is quadratic in sequence length — cut it in half and each step gets cheaper. TRAIN_SEQ_LEN=512 instead of 1024. The model sees shorter documents during training but the same amount of text per step. If the speed gain is big enough, the extra steps could compensate for the lost long-range context. The evaluation still runs at full length.

The Shot — Shorter Training Context

Why would training on shorter text sequences help?

In golf, there’s a school of thought that says “train short, play long.” Practice your 50-yard chips and your putting, and the full-length game will follow. The logic: short-range precision translates to long-range performance, and you get more reps per hour on the practice green than on the driving range.

The same idea applies to transformer training. The self-attention mechanism compares every token against every previous token, which means its compute cost grows quadratically with sequence length. A 1024-token sequence requires 4× the attention computation of a 512-token sequence. If we train at 512 tokens, each step is cheaper, and we get more steps in our wall clock budget.

The trade-off: shorter sequences mean the model never sees dependencies longer than 512 tokens during training. It can’t learn that a pronoun on token 800 refers to a noun on token 200. When we evaluate on full-length documents, the model has to extrapolate beyond what it was trained on. RoPE (rotary position embeddings) is designed to handle some extrapolation, but it’s not perfect.

For this experiment, the results were clear: step time barely improved (224ms vs 229ms — only 2% faster) because at our small batch size of 131K tokens, attention isn’t the bottleneck. The matrix multiplications in the MLP and attention projections dominate, and those don’t depend on sequence length. Meanwhile, the quality loss from shorter context was real: 1.4425 vs 1.4139 BPB.

The lesson: “train short, play long” only works if “train short” actually saves meaningful time. At our batch size, it doesn’t.

On the Tee

(Whispering) The caddy has made an unusual recommendation today. Half the context length. Five hundred and twelve tokens where there were a thousand and twenty-four. The competitor can only see… half the fairway, as it were. One hopes the putting compensates.

Results

MetricValue
val_bpb1.4425
val_loss2.4357
params17,059,912
artifact13.53 MB (yes < 16MB)
wall time300s
steps completed~1,340
step avg224ms

vs Hole 5 (SEQ_LEN=1024)

Hole 5 (seq=1024)Hole 9 (seq=512)
val_bpb1.41391.4425
step avg229ms224ms (-2%)
steps1,309~1,340 (+2%)

Barely faster, noticeably worse. Context matters more than throughput at this batch size.

The Booth Reacts

Trent: One-point-four-four-two-five. (Slight wince) That is a regression of nearly three hundredths from our best. The shorter training context has, I’m afraid, done rather more harm than good. The step time improved by a mere two percent — hardly the windfall the caddy had anticipated. The model, it would seem, genuinely needs to see a thousand tokens of context to do its best work. A case of the short game letting down the long game, rather than the reverse.

Slice: Two percent faster?! TWO?! We chopped the context in HALF for two lousy percent?! Boss, that’s like… that’s like taking a shortcut through the woods and saving ten seconds while losing three balls. The attention wasn’t even the bottleneck! We already fixed the throughput problem in Hole 5 with the small batch. At 131K tokens per step, the sequence length doesn’t matter for speed — it’s all matrix multiplies either way. This was a dead end and I could’ve told you that if anyone had ASKED me. (Crosses arms)

Trent: (Adjusting glasses) To be fair, one needed to verify the hypothesis empirically. And now we know.

The Booth Reacts

Trent Fairway TF
Trent Fairway
One-point-four-four-two-five. (Slight wince) That is a regression of nearly three hundredths from our best. The shorter training context has, I'm afraid, done rather more harm than good. The step time improved by a mere two percent — hardly the windfall the caddy had anticipated. The model, it would seem, genuinely needs to see a thousand tokens of context to do its best work. A case of the short game letting down the long game, rather than the reverse.
Slice Shanksalot
Two percent faster?! TWO?! We chopped the context in HALF for two lousy percent?! Boss, that's like... that's like taking a shortcut through the woods and saving ten seconds while losing three balls. The attention wasn't even the bottleneck! We already fixed the throughput problem in Hole 5 with the small batch. At 131K tokens per step, the sequence length doesn't matter for speed — it's all matrix multiplies either way. This was a dead end and I could've told you that if anyone had ASKED me. (Crosses arms)
Slice Shanksalot SS
Trent Fairway TF
Trent Fairway
(Adjusting glasses) To be fair, one needed to verify the hypothesis empirically. And now we know.

The Card

Scorecard
Result Dead End

Dropped a shot versus the last hole

This hole lost 0.0155 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 2,473,297 bytes of artifact headroom.

1.4425 i +0.2181 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.4357 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 17,059,912 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 13.53 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 300s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 1,340 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 224ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 13.53 MB < 16MB limit 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 2.4271 train_loss step
This hole Baseline (R2)
Post-Round Lesson

At this batch size, context length was not the real bottleneck. We paid the quality cost without getting enough speed back.

vs. the Field

+0.2228 vs SOTA (1.2197)
+0.2181 vs Baseline (1.2244)
+0.2181 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.4425
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
The playbook says "throughput first." Attention cost is quadratic in sequence length — cut it in half and each step gets cheaper. TRAIN_SEQ_LEN=512 instead of 1024. The m...
Trent Fairway
Trent Fairway On The Tee
(Whispering) The caddy has made an unusual recommendation today. Half the context length. Five hundred and twelve tokens where there were a thousand and twenty-four. The...
Slice Shanksalot
Slice Shanksalot Booth Reaction
Two percent faster?! TWO?! We chopped the context in HALF for two lousy percent?! Boss, that's like... that's like taking a shortcut through the woods and saving ten seco...

Model Card

How this hole was run

Run ID round_009_seq512
Status ok
Backend cuda
Key Overrides
TRAIN_SEQ_LEN=512TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=300
Back Up The Fairway Round 1, Hole 8 The Skinny Iron Head To The Next Tee Round 1, Hole 10 The Extra Set of Eyes