Hole 6: The Quant Gap Fix — Gradient Descent Country Club

Round 1, Hole 6 Triple Bogey March 18, 2026

1.4141

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1897 vs last hole: +0.0002

Tee Box R1 · H6

Artifact 13.68 MB

Headroom 2.32 MB Room left under 16 MB

Tempo 231 ms 1,300 steps

Looper The Caddie

Safe Tweaks

The leaderboard leader, PR #42, got their biggest win from a simple trick: keep the embedding weights in fp16 during export instead of crushing them to INT8 like everything else. Kills the quantization gap from 0.007 BPB down to almost nothing. I patched train_gpt.py with a seven-line change to the quantizer. Costs about 500KB in artifact size — well within budget. Let's see if it moves the needle.

Technical Read

Keeping tied embeddings in fp16 should shrink the quantization gap, but only if the model is already well-converged.

Trent Fairway On the Tee

(Whispering) A surgical approach today. The competitor has modified the export pipeline — preserving the embedding table at higher precision while the rest of the model endures the usual INT8 compression. It's the golfing equivalent of using a premium ball while keeping the rental clubs. The question is whether the ball matters when you're still learning the course.

Looper’s Pick

The Shot — Killing the Quantization Gap

What is the "quantization gap" and why does keeping embeddings in fp16 help?

Think of it this way: you paint a masterpiece in oil on canvas, then someone asks you to reproduce it using only 256 crayons. You can get close — the broad strokes will be right — but the subtle gradients, the delicate shading, the precise color of a sunset? Those get crushed into the nearest crayon color. That’s quantization.

After training, the model’s weights are stored in high-precision floating point (32 or 16 bits per number). For the compressed artifact, we round every weight down to an 8-bit integer — 256 possible values instead of billions. This makes the file much smaller but introduces small errors. The difference in model quality between the original weights and the quantized weights is called the “quantization gap.”

For most of the model — the attention and MLP matrices that make up 96.8% of parameters — INT8 quantization works fine. These matrices have millions of weights, and the errors average out across them.

But the embedding table is special. It’s a lookup table: each of the 1,024 tokens in our vocabulary maps to a 512-dimensional vector. When the model sees a token, it looks up that exact vector. There’s no averaging, no error cancellation — if the embedding for the word “the” is slightly wrong, it’s wrong for every occurrence of “the” in every document.

PR #42 on the Parameter Golf leaderboard discovered that keeping this one tensor in fp16 (16-bit float, 65,536 possible values instead of 256) nearly eliminates the quantization gap: from ~0.007 BPB down to ~0.0005 BPB. The cost is about 500KB of extra artifact size (1MB for the fp16 embedding vs ~525KB for INT8 + scales). With our artifact at 13.4MB, we have plenty of headroom.

The catch for our experiment: this fix only matters when the model is well-converged and the quantization noise is the dominant error. At 1,300 steps on our L40S, the model is still significantly undertrained — the training noise dwarfs the quantization noise. We expect this change to show its value on 8xH100 with 13,000+ steps, not here.

On the Tee

(Whispering) A surgical approach today. The competitor has modified the export pipeline — preserving the embedding table at higher precision while the rest of the model endures the usual INT8 compression. It’s the golfing equivalent of using a premium ball while keeping the rental clubs. The question is whether the ball matters when you’re still learning the course.

Results

Metric	Value
val_bpb	1.4141
val_loss	2.3876
params	17,059,912
artifact	13.68 MB (yes < 16MB)
wall time	300s
steps completed	1,300

Hole 5 vs Hole 6 (fp16 embed diff)

	Hole 5 (INT8 embed)	Hole 6 (fp16 embed)
val_bpb	1.4139	1.4141
artifact	13.35 MB	13.68 MB
quant gap effect	—	~0 (model too undertrained)

Verdict: Identical within noise at this step count. The fix costs 330KB extra artifact but will pay off at convergence.

The Booth Reacts

Trent: And the result is… essentially identical. One-point-four-one-four-one versus one-point-four-one-three-nine. A difference of two ten-thousandths. Now, the uninitiated viewer might see this as a wasted hole. But I would gently suggest otherwise. (Adjusting glasses) This is a club that was never designed for this particular hole. The fp16 embedding fix addresses quantization error — which is rather like polishing your shoes when you haven’t yet learned to walk. The shoes will matter later. Today, the walk is the thing.

Slice: Look, I KNOW this fix works — the PR #42 guys proved it on H100s. But we’re running 1,300 steps! The model is still trying to figure out what LANGUAGE is at this point, and we’re worried about fp16 versus INT8 on the embedding table? That’s like arguing about your putting grip when you’re 200 yards from the green. The fix is in the bag. It’ll come out when we need it. For now, can we PLEASE try something that actually moves the loss curve? I’ve been watching paint dry over here.

The Booth Reacts

Trent Fairway

And the result is... essentially identical. One-point-four-one-four-one versus one-point-four-one-three-nine. A difference of two ten-thousandths. Now, the uninitiated viewer might see this as a wasted hole. But I would gently suggest otherwise. (Adjusting glasses) This is a club that was never designed for this particular hole. The fp16 embedding fix addresses quantization error — which is rather like polishing your shoes when you haven't yet learned to walk. The shoes will matter later. Today, the walk is the thing.

Slice Shanksalot

Look, I KNOW this fix works — the PR #42 guys proved it on H100s. But we're running 1,300 steps! The model is still trying to figure out what LANGUAGE is at this point, and we're worried about fp16 versus INT8 on the embedding table? That's like arguing about your putting grip when you're 200 yards from the green. The fix is in the bag. It'll come out when we need it. For now, can we PLEASE try something that actually moves the loss curve? I've been watching paint dry over here.

The Card

Scorecard

Result Encouraging Miss

Played it straight down the middle

This hole lost 0.0002 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 2,321,228 bytes of artifact headroom.

1.4141 i +0.1897 val bpb 2.3876 i val loss 17,059,912 i params 13.68 MB i artifact 300s i wall time 1,300 i steps 231ms i step avg

0 MB 13.68 MB < 16MB limit 16 MB

Training Curve

This hole Baseline (R2)

Post-Round Lesson

Correct idea, wrong timescale. Export fixes matter most after training noise stops dominating.

vs. the Field

+0.1944 vs SOTA (1.2197)

+0.1897 vs Baseline (1.2244)

+0.1897 vs Our Best (1.2244)

SOTA
1.2197

Baseline
1.2244

Our Best
1.2244

This Hole
1.4141

← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper Caddie Notebook

The leaderboard leader, PR #42, got their biggest win from a simple trick: keep the embedding weights in fp16 during export instead of crushing them to INT8 like everythi...

Trent Fairway On The Tee

(Whispering) A surgical approach today. The competitor has modified the export pipeline — preserving the embedding table at higher precision while the rest of the model e...

Slice Shanksalot Booth Reaction

Look, I KNOW this fix works — the PR #42 guys proved it on H100s. But we're running 1,300 steps! The model is still trying to figure out what LANGUAGE is at this point, a...

Model Card

How this hole was run

Run ID round_006_fp16embed

Status ok

Backend cuda

Key Overrides

TRAIN_BATCH_TOKENS=131072FP16_EMBED=trueMAX_WALLCLOCK_SECONDS=300

Back Up The Fairway Round 1, Hole 5 More Swings, Less Club Head To The Next Tee Round 1, Hole 7 Traveling Light