Hole 11: Stacking the Gains — Gradient Descent Country Club

Round 1, Hole 11 Triple Bogey March 19, 2026

1.4079

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1835 vs last hole: +0.0022

Tee Box R1 · H11

Artifact 14.45 MB

Headroom 1.55 MB Room left under 16 MB

Tempo 221 ms 1,359 steps

Looper The Caddie

Calculated Risk

We’ve got two clubs the caddie likes: value embeddings from Hole 10 and KV=2 from Hole 7. One buys quality, the other buys tempo. In a perfect world, we stack them and stroll into the clubhouse with both gains at once. In the real world, architecture tweaks love to get precious about each other. Let’s find out which version of golf this is.

Technical Read

Combining value embeddings (Hole 10 win) with KV=2 (Hole 7 free lunch) should stack both gains — better quality plus faster steps.

Trent Fairway On the Tee

(Whispering) A delicate bit of overconfidence, perhaps. The competitor is combining two earlier wins in one swing: value embeddings from Hole 10 and reduced KV heads from Hole 7. In golfing terms, it is rather like changing both your club and your grip after a birdie. The ingredients have pedigree. Whether the recipe does is another matter entirely.

Looper’s Pick

The Shot — Combining Architectural Wins

Why wouldn't two improvements simply add up?

In golf, a new driver and a new putting grip might each save you a stroke independently. But they don’t necessarily save you two strokes together — the driver change might alter your approach angles, making the putting improvement less relevant on the shots you actually face.

Architectural changes in neural networks interact the same way. Value embeddings add extra information to the attention value signal. Reducing KV heads from 4 to 2 halves the dimensionality of that value signal — from 256 to 128 dimensions. So the value embeddings that worked at 256 dimensions now have to squeeze into 128 dimensions, which means less room for the extra information they carry.

On the speed side, the combination does stack: 221ms per step vs 236ms with just value embeddings (6% faster, ~85 more steps in 5 minutes). But the quality loss from the smaller value space partially erodes the value embedding gains.

This is a common pattern in model optimization: improvements are rarely perfectly additive. Each change reshapes the loss landscape, and the next change operates on a different surface than the one it was tested on. The discipline is to test combinations empirically rather than assuming they’ll stack, and to keep the individual changes available for mixing differently later.

On the Tee

Results

Metric	Value
val_bpb	1.4079
val_loss	2.3772
params	~17,200,000
artifact	14.45 MB (yes < 16MB)
wall time	300s
steps completed	1,359
step avg	221ms

The Stack

Hole	Config	BPB	Step avg
5	Baseline arch	1.4139	229ms
7	+ KV=2	1.4140	218ms
10	+ Value embs (KV=4)	1.4057	236ms
11	+ Value embs + KV=2	1.4079	221ms

This is the sort of result every experiment notebook needs: not a disaster, not a triumph, but a firm answer. Value embeddings at KV=4 remains the better club. KV=2 bought back some pace, but it squeezed the very mechanism that made Hole 10 special.

The Booth Reacts

Trent: One-point-four-zero-seven-nine. (Measured nod) Respectable — second-best on our card, in fact — but not the glorious stack one had hoped for. The value embeddings, it seems, prefer the wider canvas of four KV heads. Compress the value space and the flourish loses some of its force. A useful answer, though. Good clubs are not always good doubles partners.

Slice: So let me get this straight. KV=2 by itself: fine. Value embeddings by themselves: best ball on the course. Put them together and suddenly everybody forgets how to coordinate? (Throws hands up) Classic neural-network behavior. Fine. Message received. Keep the value embeddings, give them the full four KV heads, and stop trying to make every good idea marry every other good idea on the first date.

The Booth Reacts

Trent Fairway

One-point-four-zero-seven-nine. (Measured nod) Respectable — second-best on our card, in fact — but not the glorious stack one had hoped for. The value embeddings, it seems, prefer the wider canvas of four KV heads. Compress the value space and the flourish loses some of its force. A useful answer, though. Good clubs are not always good doubles partners.

Slice Shanksalot

So let me get this straight. KV=2 by itself: fine. Value embeddings by themselves: best ball on the course. Put them together and suddenly everybody forgets how to coordinate? (Throws hands up) Classic neural-network behavior. Fine. Message received. Keep the value embeddings, give them the full four KV heads, and stop trying to make every good idea marry every other good idea on the first date.

The Card

Scorecard

Result Encouraging Miss

Dropped a shot versus the last hole

This hole lost 0.0022 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 1,550,686 bytes of artifact headroom.

1.4079 i +0.1835 val bpb 2.3772 i val loss 17,200,000 i params 14.45 MB i artifact 300s i wall time 1,359 i steps 221ms i step avg

0 MB 14.45 MB < 16MB limit 16 MB

Training Curve

This hole Baseline (R2)

Post-Round Lesson

Partial stack. KV=2 shrinks the value embedding dimension too, reducing their effectiveness. The speed gain (220ms vs 236ms) didn't fully compensate.

vs. the Field

+0.1882 vs SOTA (1.2197)

+0.1835 vs Baseline (1.2244)

+0.1835 vs Our Best (1.2244)

SOTA
1.2197

Baseline
1.2244

Our Best
1.2244

This Hole
1.4079

← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper Caddie Notebook

We’ve got two clubs the caddie likes: value embeddings from Hole 10 and KV=2 from Hole 7. One buys quality, the other buys tempo. In a perfect world, we stack them and st...

Trent Fairway On The Tee

(Whispering) A delicate bit of overconfidence, perhaps. The competitor is combining two earlier wins in one swing: value embeddings from Hole 10 and reduced KV heads from...

Slice Shanksalot Booth Reaction

So let me get this straight. KV=2 by itself: fine. Value embeddings by themselves: best ball on the course. Put them together and suddenly everybody forgets how to coordi...

Model Card

How this hole was run

Run ID round_011_valemb_kv2

Status ok

Training Script train_gpt_valemb.py

Backend cuda

Key Overrides

NUM_KV_HEADS=2TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=300

Back Up The Fairway Round 1, Hole 10 The Extra Set of Eyes Head To The Next Tee Round 1, Hole 12 Finishing the Swing