The leaderboard leader, PR #42, got their biggest win from a simple trick: keep the embedding weights in fp16 during export instead of crushing them to INT8 like everythi...
1.4141
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
The leaderboard leader, PR #42, got their biggest win from a simple trick: keep the embedding weights in fp16 during export instead of crushing them to INT8 like everything else. Kills the quantization gap from 0.007 BPB down to almost nothing. I patched train_gpt.py with a seven-line change to the quantizer. Costs about 500KB in artifact size — well within budget. Let's see if it moves the needle.
Keeping tied embeddings in fp16 should shrink the quantization gap, but only if the model is already well-converged.
(Whispering) A surgical approach today. The competitor has modified the export pipeline — preserving the embedding table at higher precision while the rest of the model endures the usual INT8 compression. It's the golfing equivalent of using a premium ball while keeping the rental clubs. The question is whether the ball matters when you're still learning the course.
Looper’s Pick
The leaderboard leader, PR #42, got their biggest win from a simple trick: keep the embedding weights in fp16 during export instead of crushing them to INT8 like everything else. Kills the quantization gap from 0.007 BPB down to almost nothing. I patched train_gpt.py with a seven-line change to the quantizer. Costs about 500KB in artifact size — well within budget. Let’s see if it moves the needle.
The Shot — Killing the Quantization Gap
What is the "quantization gap" and why does keeping embeddings in fp16 help?
Think of it this way: you paint a masterpiece in oil on canvas, then someone asks you to reproduce it using only 256 crayons. You can get close — the broad strokes will be right — but the subtle gradients, the delicate shading, the precise color of a sunset? Those get crushed into the nearest crayon color. That’s quantization.
After training, the model’s weights are stored in high-precision floating point (32 or 16 bits per number). For the compressed artifact, we round every weight down to an 8-bit integer — 256 possible values instead of billions. This makes the file much smaller but introduces small errors. The difference in model quality between the original weights and the quantized weights is called the “quantization gap.”
For most of the model — the attention and MLP matrices that make up 96.8% of parameters — INT8 quantization works fine. These matrices have millions of weights, and the errors average out across them.
But the embedding table is special. It’s a lookup table: each of the 1,024 tokens in our vocabulary maps to a 512-dimensional vector. When the model sees a token, it looks up that exact vector. There’s no averaging, no error cancellation — if the embedding for the word “the” is slightly wrong, it’s wrong for every occurrence of “the” in every document.
PR #42 on the Parameter Golf leaderboard discovered that keeping this one tensor in fp16 (16-bit float, 65,536 possible values instead of 256) nearly eliminates the quantization gap: from ~0.007 BPB down to ~0.0005 BPB. The cost is about 500KB of extra artifact size (1MB for the fp16 embedding vs ~525KB for INT8 + scales). With our artifact at 13.4MB, we have plenty of headroom.
The catch for our experiment: this fix only matters when the model is well-converged and the quantization noise is the dominant error. At 1,300 steps on our L40S, the model is still significantly undertrained — the training noise dwarfs the quantization noise. We expect this change to show its value on 8xH100 with 13,000+ steps, not here.
On the Tee
(Whispering) A surgical approach today. The competitor has modified the export pipeline — preserving the embedding table at higher precision while the rest of the model endures the usual INT8 compression. It’s the golfing equivalent of using a premium ball while keeping the rental clubs. The question is whether the ball matters when you’re still learning the course.
Results
| Metric | Value |
|---|---|
| val_bpb | 1.4141 |
| val_loss | 2.3876 |
| params | 17,059,912 |
| artifact | 13.68 MB (yes < 16MB) |
| wall time | 300s |
| steps completed | 1,300 |
Hole 5 vs Hole 6 (fp16 embed diff)
| Hole 5 (INT8 embed) | Hole 6 (fp16 embed) | |
|---|---|---|
| val_bpb | 1.4139 | 1.4141 |
| artifact | 13.35 MB | 13.68 MB |
| quant gap effect | — | ~0 (model too undertrained) |
Verdict: Identical within noise at this step count. The fix costs 330KB extra artifact but will pay off at convergence.
The Booth Reacts
Trent: And the result is… essentially identical. One-point-four-one-four-one versus one-point-four-one-three-nine. A difference of two ten-thousandths. Now, the uninitiated viewer might see this as a wasted hole. But I would gently suggest otherwise. (Adjusting glasses) This is a club that was never designed for this particular hole. The fp16 embedding fix addresses quantization error — which is rather like polishing your shoes when you haven’t yet learned to walk. The shoes will matter later. Today, the walk is the thing.
Slice: Look, I KNOW this fix works — the PR #42 guys proved it on H100s. But we’re running 1,300 steps! The model is still trying to figure out what LANGUAGE is at this point, and we’re worried about fp16 versus INT8 on the embedding table? That’s like arguing about your putting grip when you’re 200 yards from the green. The fix is in the bag. It’ll come out when we need it. For now, can we PLEASE try something that actually moves the loss curve? I’ve been watching paint dry over here.
The Booth Reacts
The Card
Played it straight down the middle
This hole lost 0.0002 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 2,321,228 bytes of artifact headroom.
Training Curve
Correct idea, wrong timescale. Export fixes matter most after training noise stops dominating.
vs. the Field
1.2197
1.2244
1.2244
1.4141
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) A surgical approach today. The competitor has modified the export pipeline — preserving the embedding table at higher precision while the rest of the model e...
Look, I KNOW this fix works — the PR #42 guys proved it on H100s. But we're running 1,300 steps! The model is still trying to figure out what LANGUAGE is at this point, a...
Model Card
How this hole was run
round_006_fp16embed ok cuda