Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 10 Triple Bogey March 19, 2026

1.4057

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1813 vs last hole: -0.0082
Tee Box R1 · H10
Artifact 14.48 MB
Headroom 1.52 MB Room left under 16 MB
Tempo 236 ms 1,274 steps
Looper
Looper The Caddie
Calculated Risk

Every hole so far has been env var golf: same chassis, same swing, different knobs. Time to finally pop the hood. The NanoGPT speedrun crowd keeps raving about value embeddings — extra lookup tables that feed directly into the attention values. Near-zero compute cost, a modest parameter bump, and repeated wins at the 124M scale. I patched train_gpt.py with value embedding tables shared across layer pairs. If the local rig is ready for a real architecture shot, this is the club.

Technical Read

Adding learned value embeddings to attention should enrich the value signal at near-zero compute cost, improving quality per step.

Trent Fairway
Trent Fairway On the Tee

(Whispering) A notable hush around the grounds now. For the first time on this course, the competitor has changed the architecture itself — not the tempo, not the learning rate, but the shape of the machine. Value embeddings. Five new lookup tables stitched into the attention values. This is no longer pro-shop tuning. This is custom clubmaking.

Looper’s Pick

Every hole so far has been env var golf: same chassis, same swing, different knobs. Time to finally pop the hood. The NanoGPT speedrun crowd keeps raving about value embeddings — extra lookup tables that feed directly into the attention values. Near-zero compute cost, a modest parameter bump, and repeated wins at the 124M scale. I patched train_gpt.py with value embedding tables shared across layer pairs. If the local rig is ready for a real architecture shot, this is the club.

The Shot — Value Embeddings

What are value embeddings and why do they help?

In golf, your caddy doesn’t just hand you a club — a great caddy tells you things about the hole that change how you play the shot. “The green slopes left.” “There’s a hidden bunker behind that ridge.” That extra context changes the outcome even though the physical swing is the same.

Value embeddings work the same way. In standard self-attention, each token creates a Value vector (the “information to pass along”) by multiplying its representation through a learned weight matrix. This is the only source of value information — everything the attention mechanism passes forward comes from that single linear projection.

Value embeddings add a second source. We create a separate embedding table — a lookup that maps each token ID directly to a value-sized vector, bypassing the attention projection entirely. The final value becomes: V = projected_value + λ * value_embedding, where λ is a learned mixing weight (initialized near zero so training starts from the baseline behavior).

Why does this help? The projected value has to squeeze through a bottleneck: the token’s representation at this layer, multiplied by a weight matrix. The value embedding provides a direct shortcut — token-specific information that doesn’t need to be reconstructed from context. Think of it as giving each token a “cheat sheet” of its most useful properties.

We share each value embedding table across pairs of layers (layers 0&1 share one table, layers 2&3 share another, etc.) to keep the parameter cost low. With vocab=1024 and kv_dim=256, each table is 262K params. Five tables for 9 layers = ~1.3M extra parameters, about 7% of the baseline model.

This technique was one of the most consistently cited wins in the NanoGPT speedrun community, validated at the 124M parameter scale — very close to our 17M regime.

On the Tee

(Whispering) A notable hush around the grounds now. For the first time on this course, the competitor has changed the architecture itself — not the tempo, not the learning rate, but the shape of the machine. Value embeddings. Five new lookup tables stitched into the attention values. This is no longer pro-shop tuning. This is custom clubmaking.

Results

MetricValue
val_bpb1.4057
val_loss2.3734
params~18,380,000
artifact14.48 MB (yes < 16MB)
wall time300s
steps completed~1,274
step avg236ms

vs Hole 5 (best env-var-only config)

Hole 5 (baseline arch)Hole 10 (+ value embs)
val_bpb1.41391.4057
delta-0.0082
params17,059,912~18,380,000 (+7.7%)
step avg229ms236ms (+3%)
artifact13.35 MB14.48 MB

More parameters, slightly slower steps, and still a real gain on the card. This is the first change in a while that asks for something and clearly gives something back.

The Booth Reacts

Trent: (Leaning into microphone) One-point-four-zero-five-seven. Ladies and gentlemen, that is a new personal best on this course. Eight thousandths clear of every previous attempt. And it arrives not from a learning-rate shuffle or a batch-size parlor trick, but from a bona fide architectural improvement. Value embeddings. The competitor has altered the shape of the club and found a cleaner strike. Yes, the artifact is heavier and the tempo a touch slower, but one suspects the membership will forgive such things when the ball lands here.

Slice: NOW we’re actually playing golf. Eight thousandths may not sound like much until you remember NOTHING else was moving this number. We tried recipe tweaks. We tried austerity. We tried wishful thinking. Then the kid bolts on value embeddings and suddenly we’ve got a real birdie chance. That’s not noise. That’s a club change. (Taps table emphatically) Put this thing back in the bag immediately and don’t let anybody “simplify” it out of the lineup.

The Booth Reacts

Trent Fairway TF
Trent Fairway
(Leaning into microphone) One-point-four-zero-five-seven. Ladies and gentlemen, that is a new personal best on this course. Eight thousandths clear of every previous attempt. And it arrives not from a learning-rate shuffle or a batch-size parlor trick, but from a bona fide architectural improvement. Value embeddings. The competitor has altered the shape of the club and found a cleaner strike. Yes, the artifact is heavier and the tempo a touch slower, but one suspects the membership will forgive such things when the ball lands here.
Slice Shanksalot
NOW we're actually playing golf. Eight thousandths may not sound like much until you remember NOTHING else was moving this number. We tried recipe tweaks. We tried austerity. We tried wishful thinking. Then the kid bolts on value embeddings and suddenly we've got a real birdie chance. That's not noise. That's a club change. (Taps table emphatically) Put this thing back in the bag immediately and don’t let anybody “simplify” it out of the lineup.
Slice Shanksalot SS

The Card

Scorecard
Result Free Lunch

Picked up strokes on the field

This hole improved 0.0082 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 1,523,391 bytes of artifact headroom.

1.4057 i +0.1813 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.3734 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 18,380,000 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 14.48 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 300s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 1,274 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 236ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 14.48 MB < 16MB limit 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 2.3458 train_loss step
This hole Baseline (R2)
Post-Round Lesson

First real architecture win. The value embeddings earned their 1.3M parameters — 0.008 BPB improvement over the baseline config.

vs. the Field

+0.1860 vs SOTA (1.2197)
+0.1813 vs Baseline (1.2244)
+0.1813 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.4057
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
Every hole so far has been env var golf: same chassis, same swing, different knobs. Time to finally pop the hood. The NanoGPT speedrun crowd keeps raving about value embe...
Trent Fairway
Trent Fairway On The Tee
(Whispering) A notable hush around the grounds now. For the first time on this course, the competitor has changed the architecture itself — not the tempo, not the learnin...
Slice Shanksalot
Slice Shanksalot Booth Reaction
NOW we're actually playing golf. Eight thousandths may not sound like much until you remember NOTHING else was moving this number. We tried recipe tweaks. We tried auster...

Model Card

How this hole was run

Run ID round_010_valemb
Status ok
Training Script train_gpt_valemb.py
Backend cuda
Key Overrides
TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=300
Back Up The Fairway Round 1, Hole 9 The Short Game Head To The Next Tee Round 1, Hole 11 Stacking the Gains