Hole 10: The Extra Set of Eyes — Gradient Descent Country Club

Round 1, Hole 10 Triple Bogey March 19, 2026

1.4057

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1813 vs last hole: -0.0082

Tee Box R1 · H10

Artifact 14.48 MB

Headroom 1.52 MB Room left under 16 MB

Tempo 236 ms 1,274 steps

Looper The Caddie

Calculated Risk

Every hole so far has been env var golf: same chassis, same swing, different knobs. Time to finally pop the hood. The NanoGPT speedrun crowd keeps raving about value embeddings — extra lookup tables that feed directly into the attention values. Near-zero compute cost, a modest parameter bump, and repeated wins at the 124M scale. I patched train_gpt.py with value embedding tables shared across layer pairs. If the local rig is ready for a real architecture shot, this is the club.

Technical Read

Adding learned value embeddings to attention should enrich the value signal at near-zero compute cost, improving quality per step.

Trent Fairway On the Tee

(Whispering) A notable hush around the grounds now. For the first time on this course, the competitor has changed the architecture itself — not the tempo, not the learning rate, but the shape of the machine. Value embeddings. Five new lookup tables stitched into the attention values. This is no longer pro-shop tuning. This is custom clubmaking.

Looper’s Pick

The Shot — Value Embeddings

What are value embeddings and why do they help?

In golf, your caddy doesn’t just hand you a club — a great caddy tells you things about the hole that change how you play the shot. “The green slopes left.” “There’s a hidden bunker behind that ridge.” That extra context changes the outcome even though the physical swing is the same.

Value embeddings work the same way. In standard self-attention, each token creates a Value vector (the “information to pass along”) by multiplying its representation through a learned weight matrix. This is the only source of value information — everything the attention mechanism passes forward comes from that single linear projection.

Value embeddings add a second source. We create a separate embedding table — a lookup that maps each token ID directly to a value-sized vector, bypassing the attention projection entirely. The final value becomes: V = projected_value + λ * value_embedding, where λ is a learned mixing weight (initialized near zero so training starts from the baseline behavior).

Why does this help? The projected value has to squeeze through a bottleneck: the token’s representation at this layer, multiplied by a weight matrix. The value embedding provides a direct shortcut — token-specific information that doesn’t need to be reconstructed from context. Think of it as giving each token a “cheat sheet” of its most useful properties.

We share each value embedding table across pairs of layers (layers 0&1 share one table, layers 2&3 share another, etc.) to keep the parameter cost low. With vocab=1024 and kv_dim=256, each table is 262K params. Five tables for 9 layers = ~1.3M extra parameters, about 7% of the baseline model.

This technique was one of the most consistently cited wins in the NanoGPT speedrun community, validated at the 124M parameter scale — very close to our 17M regime.

On the Tee

Results

Metric	Value
val_bpb	1.4057
val_loss	2.3734
params	~18,380,000
artifact	14.48 MB (yes < 16MB)
wall time	300s
steps completed	~1,274
step avg	236ms

vs Hole 5 (best env-var-only config)

	Hole 5 (baseline arch)	Hole 10 (+ value embs)
val_bpb	1.4139	1.4057
delta	—	-0.0082
params	17,059,912	~18,380,000 (+7.7%)
step avg	229ms	236ms (+3%)
artifact	13.35 MB	14.48 MB

More parameters, slightly slower steps, and still a real gain on the card. This is the first change in a while that asks for something and clearly gives something back.

The Booth Reacts

Trent: (Leaning into microphone) One-point-four-zero-five-seven. Ladies and gentlemen, that is a new personal best on this course. Eight thousandths clear of every previous attempt. And it arrives not from a learning-rate shuffle or a batch-size parlor trick, but from a bona fide architectural improvement. Value embeddings. The competitor has altered the shape of the club and found a cleaner strike. Yes, the artifact is heavier and the tempo a touch slower, but one suspects the membership will forgive such things when the ball lands here.

Slice: NOW we’re actually playing golf. Eight thousandths may not sound like much until you remember NOTHING else was moving this number. We tried recipe tweaks. We tried austerity. We tried wishful thinking. Then the kid bolts on value embeddings and suddenly we’ve got a real birdie chance. That’s not noise. That’s a club change. (Taps table emphatically) Put this thing back in the bag immediately and don’t let anybody “simplify” it out of the lineup.

The Booth Reacts

Trent Fairway

(Leaning into microphone) One-point-four-zero-five-seven. Ladies and gentlemen, that is a new personal best on this course. Eight thousandths clear of every previous attempt. And it arrives not from a learning-rate shuffle or a batch-size parlor trick, but from a bona fide architectural improvement. Value embeddings. The competitor has altered the shape of the club and found a cleaner strike. Yes, the artifact is heavier and the tempo a touch slower, but one suspects the membership will forgive such things when the ball lands here.

Slice Shanksalot

NOW we're actually playing golf. Eight thousandths may not sound like much until you remember NOTHING else was moving this number. We tried recipe tweaks. We tried austerity. We tried wishful thinking. Then the kid bolts on value embeddings and suddenly we've got a real birdie chance. That's not noise. That's a club change. (Taps table emphatically) Put this thing back in the bag immediately and don’t let anybody “simplify” it out of the lineup.

The Card

Scorecard

Result Free Lunch

Picked up strokes on the field

This hole improved 0.0082 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 1,523,391 bytes of artifact headroom.

1.4057 i +0.1813 val bpb 2.3734 i val loss 18,380,000 i params 14.48 MB i artifact 300s i wall time 1,274 i steps 236ms i step avg

0 MB 14.48 MB < 16MB limit 16 MB

Training Curve

This hole Baseline (R2)

Post-Round Lesson

First real architecture win. The value embeddings earned their 1.3M parameters — 0.008 BPB improvement over the baseline config.

vs. the Field

+0.1860 vs SOTA (1.2197)

+0.1813 vs Baseline (1.2244)

+0.1813 vs Our Best (1.2244)

SOTA
1.2197

Baseline
1.2244

Our Best
1.2244

This Hole
1.4057

← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper Caddie Notebook

Every hole so far has been env var golf: same chassis, same swing, different knobs. Time to finally pop the hood. The NanoGPT speedrun crowd keeps raving about value embe...

Trent Fairway On The Tee

(Whispering) A notable hush around the grounds now. For the first time on this course, the competitor has changed the architecture itself — not the tempo, not the learnin...

Slice Shanksalot Booth Reaction

NOW we're actually playing golf. Eight thousandths may not sound like much until you remember NOTHING else was moving this number. We tried recipe tweaks. We tried auster...

Model Card

How this hole was run

Run ID round_010_valemb

Status ok

Training Script train_gpt_valemb.py

Backend cuda

Key Overrides

TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=300

Back Up The Fairway Round 1, Hole 9 The Short Game Head To The Next Tee Round 1, Hole 11 Stacking the Gains