Every hole so far has been env var golf: same chassis, same swing, different knobs. Time to finally pop the hood. The NanoGPT speedrun crowd keeps raving about value embe...
1.4057
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
Every hole so far has been env var golf: same chassis, same swing, different knobs. Time to finally pop the hood. The NanoGPT speedrun crowd keeps raving about value embeddings — extra lookup tables that feed directly into the attention values. Near-zero compute cost, a modest parameter bump, and repeated wins at the 124M scale. I patched train_gpt.py with value embedding tables shared across layer pairs. If the local rig is ready for a real architecture shot, this is the club.
Adding learned value embeddings to attention should enrich the value signal at near-zero compute cost, improving quality per step.
(Whispering) A notable hush around the grounds now. For the first time on this course, the competitor has changed the architecture itself — not the tempo, not the learning rate, but the shape of the machine. Value embeddings. Five new lookup tables stitched into the attention values. This is no longer pro-shop tuning. This is custom clubmaking.
Looper’s Pick
Every hole so far has been env var golf: same chassis, same swing, different knobs. Time to finally pop the hood. The NanoGPT speedrun crowd keeps raving about value embeddings — extra lookup tables that feed directly into the attention values. Near-zero compute cost, a modest parameter bump, and repeated wins at the 124M scale. I patched train_gpt.py with value embedding tables shared across layer pairs. If the local rig is ready for a real architecture shot, this is the club.
The Shot — Value Embeddings
What are value embeddings and why do they help?
In golf, your caddy doesn’t just hand you a club — a great caddy tells you things about the hole that change how you play the shot. “The green slopes left.” “There’s a hidden bunker behind that ridge.” That extra context changes the outcome even though the physical swing is the same.
Value embeddings work the same way. In standard self-attention, each token creates a Value vector (the “information to pass along”) by multiplying its representation through a learned weight matrix. This is the only source of value information — everything the attention mechanism passes forward comes from that single linear projection.
Value embeddings add a second source. We create a separate embedding table — a lookup that maps each token ID directly to a value-sized vector, bypassing the attention projection entirely. The final value becomes: V = projected_value + λ * value_embedding, where λ is a learned mixing weight (initialized near zero so training starts from the baseline behavior).
Why does this help? The projected value has to squeeze through a bottleneck: the token’s representation at this layer, multiplied by a weight matrix. The value embedding provides a direct shortcut — token-specific information that doesn’t need to be reconstructed from context. Think of it as giving each token a “cheat sheet” of its most useful properties.
We share each value embedding table across pairs of layers (layers 0&1 share one table, layers 2&3 share another, etc.) to keep the parameter cost low. With vocab=1024 and kv_dim=256, each table is 262K params. Five tables for 9 layers = ~1.3M extra parameters, about 7% of the baseline model.
This technique was one of the most consistently cited wins in the NanoGPT speedrun community, validated at the 124M parameter scale — very close to our 17M regime.
On the Tee
(Whispering) A notable hush around the grounds now. For the first time on this course, the competitor has changed the architecture itself — not the tempo, not the learning rate, but the shape of the machine. Value embeddings. Five new lookup tables stitched into the attention values. This is no longer pro-shop tuning. This is custom clubmaking.
Results
| Metric | Value |
|---|---|
| val_bpb | 1.4057 |
| val_loss | 2.3734 |
| params | ~18,380,000 |
| artifact | 14.48 MB (yes < 16MB) |
| wall time | 300s |
| steps completed | ~1,274 |
| step avg | 236ms |
vs Hole 5 (best env-var-only config)
| Hole 5 (baseline arch) | Hole 10 (+ value embs) | |
|---|---|---|
| val_bpb | 1.4139 | 1.4057 |
| delta | — | -0.0082 |
| params | 17,059,912 | ~18,380,000 (+7.7%) |
| step avg | 229ms | 236ms (+3%) |
| artifact | 13.35 MB | 14.48 MB |
More parameters, slightly slower steps, and still a real gain on the card. This is the first change in a while that asks for something and clearly gives something back.
The Booth Reacts
Trent: (Leaning into microphone) One-point-four-zero-five-seven. Ladies and gentlemen, that is a new personal best on this course. Eight thousandths clear of every previous attempt. And it arrives not from a learning-rate shuffle or a batch-size parlor trick, but from a bona fide architectural improvement. Value embeddings. The competitor has altered the shape of the club and found a cleaner strike. Yes, the artifact is heavier and the tempo a touch slower, but one suspects the membership will forgive such things when the ball lands here.
Slice: NOW we’re actually playing golf. Eight thousandths may not sound like much until you remember NOTHING else was moving this number. We tried recipe tweaks. We tried austerity. We tried wishful thinking. Then the kid bolts on value embeddings and suddenly we’ve got a real birdie chance. That’s not noise. That’s a club change. (Taps table emphatically) Put this thing back in the bag immediately and don’t let anybody “simplify” it out of the lineup.
The Booth Reacts
The Card
Picked up strokes on the field
This hole improved 0.0082 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 1,523,391 bytes of artifact headroom.
Training Curve
First real architecture win. The value embeddings earned their 1.3M parameters — 0.008 BPB improvement over the baseline config.
vs. the Field
1.2197
1.2244
1.2244
1.4057
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) A notable hush around the grounds now. For the first time on this course, the competitor has changed the architecture itself — not the tempo, not the learnin...
NOW we're actually playing golf. Eight thousandths may not sound like much until you remember NOTHING else was moving this number. We tried recipe tweaks. We tried auster...
Model Card
How this hole was run
round_010_valemb ok train_gpt_valemb.py cuda