We’ve got two clubs the caddie likes: value embeddings from Hole 10 and KV=2 from Hole 7. One buys quality, the other buys tempo. In a perfect world, we stack them and st...
1.4079
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
We’ve got two clubs the caddie likes: value embeddings from Hole 10 and KV=2 from Hole 7. One buys quality, the other buys tempo. In a perfect world, we stack them and stroll into the clubhouse with both gains at once. In the real world, architecture tweaks love to get precious about each other. Let’s find out which version of golf this is.
Combining value embeddings (Hole 10 win) with KV=2 (Hole 7 free lunch) should stack both gains — better quality plus faster steps.
(Whispering) A delicate bit of overconfidence, perhaps. The competitor is combining two earlier wins in one swing: value embeddings from Hole 10 and reduced KV heads from Hole 7. In golfing terms, it is rather like changing both your club and your grip after a birdie. The ingredients have pedigree. Whether the recipe does is another matter entirely.
Looper’s Pick
We’ve got two clubs the caddie likes: value embeddings from Hole 10 and KV=2 from Hole 7. One buys quality, the other buys tempo. In a perfect world, we stack them and stroll into the clubhouse with both gains at once. In the real world, architecture tweaks love to get precious about each other. Let’s find out which version of golf this is.
The Shot — Combining Architectural Wins
Why wouldn't two improvements simply add up?
In golf, a new driver and a new putting grip might each save you a stroke independently. But they don’t necessarily save you two strokes together — the driver change might alter your approach angles, making the putting improvement less relevant on the shots you actually face.
Architectural changes in neural networks interact the same way. Value embeddings add extra information to the attention value signal. Reducing KV heads from 4 to 2 halves the dimensionality of that value signal — from 256 to 128 dimensions. So the value embeddings that worked at 256 dimensions now have to squeeze into 128 dimensions, which means less room for the extra information they carry.
On the speed side, the combination does stack: 221ms per step vs 236ms with just value embeddings (6% faster, ~85 more steps in 5 minutes). But the quality loss from the smaller value space partially erodes the value embedding gains.
This is a common pattern in model optimization: improvements are rarely perfectly additive. Each change reshapes the loss landscape, and the next change operates on a different surface than the one it was tested on. The discipline is to test combinations empirically rather than assuming they’ll stack, and to keep the individual changes available for mixing differently later.
On the Tee
(Whispering) A delicate bit of overconfidence, perhaps. The competitor is combining two earlier wins in one swing: value embeddings from Hole 10 and reduced KV heads from Hole 7. In golfing terms, it is rather like changing both your club and your grip after a birdie. The ingredients have pedigree. Whether the recipe does is another matter entirely.
Results
| Metric | Value |
|---|---|
| val_bpb | 1.4079 |
| val_loss | 2.3772 |
| params | ~17,200,000 |
| artifact | 14.45 MB (yes < 16MB) |
| wall time | 300s |
| steps completed | 1,359 |
| step avg | 221ms |
The Stack
| Hole | Config | BPB | Step avg |
|---|---|---|---|
| 5 | Baseline arch | 1.4139 | 229ms |
| 7 | + KV=2 | 1.4140 | 218ms |
| 10 | + Value embs (KV=4) | 1.4057 | 236ms |
| 11 | + Value embs + KV=2 | 1.4079 | 221ms |
This is the sort of result every experiment notebook needs: not a disaster, not a triumph, but a firm answer. Value embeddings at KV=4 remains the better club. KV=2 bought back some pace, but it squeezed the very mechanism that made Hole 10 special.
The Booth Reacts
Trent: One-point-four-zero-seven-nine. (Measured nod) Respectable — second-best on our card, in fact — but not the glorious stack one had hoped for. The value embeddings, it seems, prefer the wider canvas of four KV heads. Compress the value space and the flourish loses some of its force. A useful answer, though. Good clubs are not always good doubles partners.
Slice: So let me get this straight. KV=2 by itself: fine. Value embeddings by themselves: best ball on the course. Put them together and suddenly everybody forgets how to coordinate? (Throws hands up) Classic neural-network behavior. Fine. Message received. Keep the value embeddings, give them the full four KV heads, and stop trying to make every good idea marry every other good idea on the first date.
The Booth Reacts
The Card
Dropped a shot versus the last hole
This hole lost 0.0022 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 1,550,686 bytes of artifact headroom.
Training Curve
Partial stack. KV=2 shrinks the value embedding dimension too, reducing their effectiveness. The speed gain (220ms vs 236ms) didn't fully compensate.
vs. the Field
1.2197
1.2244
1.2244
1.4079
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) A delicate bit of overconfidence, perhaps. The competitor is combining two earlier wins in one swing: value embeddings from Hole 10 and reduced KV heads from...
So let me get this straight. KV=2 by itself: fine. Value embeddings by themselves: best ball on the course. Put them together and suddenly everybody forgets how to coordi...
Model Card
How this hole was run
round_011_valemb_kv2 ok train_gpt_valemb.py cuda