The playbook says "throughput first." Attention cost is quadratic in sequence length — cut it in half and each step gets cheaper. TRAIN_SEQ_LEN=512 instead of 1024. The m...
1.4425
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
The playbook says "throughput first." Attention cost is quadratic in sequence length — cut it in half and each step gets cheaper. TRAIN_SEQ_LEN=512 instead of 1024. The model sees shorter documents during training but the same amount of text per step. If the speed gain is big enough, the extra steps could compensate for the lost long-range context. The evaluation still runs at full length.
Halving training context should make attention cheap enough to buy meaningful extra steps without wrecking eval.
(Whispering) The caddy has made an unusual recommendation today. Half the context length. Five hundred and twelve tokens where there were a thousand and twenty-four. The competitor can only see... half the fairway, as it were. One hopes the putting compensates.
Looper’s Pick
The playbook says “throughput first.” Attention cost is quadratic in sequence length — cut it in half and each step gets cheaper. TRAIN_SEQ_LEN=512 instead of 1024. The model sees shorter documents during training but the same amount of text per step. If the speed gain is big enough, the extra steps could compensate for the lost long-range context. The evaluation still runs at full length.
The Shot — Shorter Training Context
Why would training on shorter text sequences help?
In golf, there’s a school of thought that says “train short, play long.” Practice your 50-yard chips and your putting, and the full-length game will follow. The logic: short-range precision translates to long-range performance, and you get more reps per hour on the practice green than on the driving range.
The same idea applies to transformer training. The self-attention mechanism compares every token against every previous token, which means its compute cost grows quadratically with sequence length. A 1024-token sequence requires 4× the attention computation of a 512-token sequence. If we train at 512 tokens, each step is cheaper, and we get more steps in our wall clock budget.
The trade-off: shorter sequences mean the model never sees dependencies longer than 512 tokens during training. It can’t learn that a pronoun on token 800 refers to a noun on token 200. When we evaluate on full-length documents, the model has to extrapolate beyond what it was trained on. RoPE (rotary position embeddings) is designed to handle some extrapolation, but it’s not perfect.
For this experiment, the results were clear: step time barely improved (224ms vs 229ms — only 2% faster) because at our small batch size of 131K tokens, attention isn’t the bottleneck. The matrix multiplications in the MLP and attention projections dominate, and those don’t depend on sequence length. Meanwhile, the quality loss from shorter context was real: 1.4425 vs 1.4139 BPB.
The lesson: “train short, play long” only works if “train short” actually saves meaningful time. At our batch size, it doesn’t.
On the Tee
(Whispering) The caddy has made an unusual recommendation today. Half the context length. Five hundred and twelve tokens where there were a thousand and twenty-four. The competitor can only see… half the fairway, as it were. One hopes the putting compensates.
Results
| Metric | Value |
|---|---|
| val_bpb | 1.4425 |
| val_loss | 2.4357 |
| params | 17,059,912 |
| artifact | 13.53 MB (yes < 16MB) |
| wall time | 300s |
| steps completed | ~1,340 |
| step avg | 224ms |
vs Hole 5 (SEQ_LEN=1024)
| Hole 5 (seq=1024) | Hole 9 (seq=512) | |
|---|---|---|
| val_bpb | 1.4139 | 1.4425 |
| step avg | 229ms | 224ms (-2%) |
| steps | 1,309 | ~1,340 (+2%) |
Barely faster, noticeably worse. Context matters more than throughput at this batch size.
The Booth Reacts
Trent: One-point-four-four-two-five. (Slight wince) That is a regression of nearly three hundredths from our best. The shorter training context has, I’m afraid, done rather more harm than good. The step time improved by a mere two percent — hardly the windfall the caddy had anticipated. The model, it would seem, genuinely needs to see a thousand tokens of context to do its best work. A case of the short game letting down the long game, rather than the reverse.
Slice: Two percent faster?! TWO?! We chopped the context in HALF for two lousy percent?! Boss, that’s like… that’s like taking a shortcut through the woods and saving ten seconds while losing three balls. The attention wasn’t even the bottleneck! We already fixed the throughput problem in Hole 5 with the small batch. At 131K tokens per step, the sequence length doesn’t matter for speed — it’s all matrix multiplies either way. This was a dead end and I could’ve told you that if anyone had ASKED me. (Crosses arms)
Trent: (Adjusting glasses) To be fair, one needed to verify the hypothesis empirically. And now we know.
The Booth Reacts
The Card
Dropped a shot versus the last hole
This hole lost 0.0155 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 2,473,297 bytes of artifact headroom.
Training Curve
At this batch size, context length was not the real bottleneck. We paid the quality cost without getting enough speed back.
vs. the Field
1.2197
1.2244
1.2244
1.4425
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) The caddy has made an unusual recommendation today. Half the context length. Five hundred and twelve tokens where there were a thousand and twenty-four. The...
Two percent faster?! TWO?! We chopped the context in HALF for two lousy percent?! Boss, that's like... that's like taking a shortcut through the woods and saving ten seco...
Model Card
How this hole was run
round_009_seq512 ok cuda