Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 6 Triple Bogey March 18, 2026

1.4141

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1897 vs last hole: +0.0002
Tee Box R1 · H6
Artifact 13.68 MB
Headroom 2.32 MB Room left under 16 MB
Tempo 231 ms 1,300 steps
Looper
Looper The Caddie
Safe Tweaks

The leaderboard leader, PR #42, got their biggest win from a simple trick: keep the embedding weights in fp16 during export instead of crushing them to INT8 like everything else. Kills the quantization gap from 0.007 BPB down to almost nothing. I patched train_gpt.py with a seven-line change to the quantizer. Costs about 500KB in artifact size — well within budget. Let's see if it moves the needle.

Technical Read

Keeping tied embeddings in fp16 should shrink the quantization gap, but only if the model is already well-converged.

Trent Fairway
Trent Fairway On the Tee

(Whispering) A surgical approach today. The competitor has modified the export pipeline — preserving the embedding table at higher precision while the rest of the model endures the usual INT8 compression. It's the golfing equivalent of using a premium ball while keeping the rental clubs. The question is whether the ball matters when you're still learning the course.

Looper’s Pick

The leaderboard leader, PR #42, got their biggest win from a simple trick: keep the embedding weights in fp16 during export instead of crushing them to INT8 like everything else. Kills the quantization gap from 0.007 BPB down to almost nothing. I patched train_gpt.py with a seven-line change to the quantizer. Costs about 500KB in artifact size — well within budget. Let’s see if it moves the needle.

The Shot — Killing the Quantization Gap

What is the "quantization gap" and why does keeping embeddings in fp16 help?

Think of it this way: you paint a masterpiece in oil on canvas, then someone asks you to reproduce it using only 256 crayons. You can get close — the broad strokes will be right — but the subtle gradients, the delicate shading, the precise color of a sunset? Those get crushed into the nearest crayon color. That’s quantization.

After training, the model’s weights are stored in high-precision floating point (32 or 16 bits per number). For the compressed artifact, we round every weight down to an 8-bit integer — 256 possible values instead of billions. This makes the file much smaller but introduces small errors. The difference in model quality between the original weights and the quantized weights is called the “quantization gap.”

For most of the model — the attention and MLP matrices that make up 96.8% of parameters — INT8 quantization works fine. These matrices have millions of weights, and the errors average out across them.

But the embedding table is special. It’s a lookup table: each of the 1,024 tokens in our vocabulary maps to a 512-dimensional vector. When the model sees a token, it looks up that exact vector. There’s no averaging, no error cancellation — if the embedding for the word “the” is slightly wrong, it’s wrong for every occurrence of “the” in every document.

PR #42 on the Parameter Golf leaderboard discovered that keeping this one tensor in fp16 (16-bit float, 65,536 possible values instead of 256) nearly eliminates the quantization gap: from ~0.007 BPB down to ~0.0005 BPB. The cost is about 500KB of extra artifact size (1MB for the fp16 embedding vs ~525KB for INT8 + scales). With our artifact at 13.4MB, we have plenty of headroom.

The catch for our experiment: this fix only matters when the model is well-converged and the quantization noise is the dominant error. At 1,300 steps on our L40S, the model is still significantly undertrained — the training noise dwarfs the quantization noise. We expect this change to show its value on 8xH100 with 13,000+ steps, not here.

On the Tee

(Whispering) A surgical approach today. The competitor has modified the export pipeline — preserving the embedding table at higher precision while the rest of the model endures the usual INT8 compression. It’s the golfing equivalent of using a premium ball while keeping the rental clubs. The question is whether the ball matters when you’re still learning the course.

Results

MetricValue
val_bpb1.4141
val_loss2.3876
params17,059,912
artifact13.68 MB (yes < 16MB)
wall time300s
steps completed1,300

Hole 5 vs Hole 6 (fp16 embed diff)

Hole 5 (INT8 embed)Hole 6 (fp16 embed)
val_bpb1.41391.4141
artifact13.35 MB13.68 MB
quant gap effect~0 (model too undertrained)

Verdict: Identical within noise at this step count. The fix costs 330KB extra artifact but will pay off at convergence.

The Booth Reacts

Trent: And the result is… essentially identical. One-point-four-one-four-one versus one-point-four-one-three-nine. A difference of two ten-thousandths. Now, the uninitiated viewer might see this as a wasted hole. But I would gently suggest otherwise. (Adjusting glasses) This is a club that was never designed for this particular hole. The fp16 embedding fix addresses quantization error — which is rather like polishing your shoes when you haven’t yet learned to walk. The shoes will matter later. Today, the walk is the thing.

Slice: Look, I KNOW this fix works — the PR #42 guys proved it on H100s. But we’re running 1,300 steps! The model is still trying to figure out what LANGUAGE is at this point, and we’re worried about fp16 versus INT8 on the embedding table? That’s like arguing about your putting grip when you’re 200 yards from the green. The fix is in the bag. It’ll come out when we need it. For now, can we PLEASE try something that actually moves the loss curve? I’ve been watching paint dry over here.

The Booth Reacts

Trent Fairway TF
Trent Fairway
And the result is... essentially identical. One-point-four-one-four-one versus one-point-four-one-three-nine. A difference of two ten-thousandths. Now, the uninitiated viewer might see this as a wasted hole. But I would gently suggest otherwise. (Adjusting glasses) This is a club that was never designed for this particular hole. The fp16 embedding fix addresses quantization error — which is rather like polishing your shoes when you haven't yet learned to walk. The shoes will matter later. Today, the walk is the thing.
Slice Shanksalot
Look, I KNOW this fix works — the PR #42 guys proved it on H100s. But we're running 1,300 steps! The model is still trying to figure out what LANGUAGE is at this point, and we're worried about fp16 versus INT8 on the embedding table? That's like arguing about your putting grip when you're 200 yards from the green. The fix is in the bag. It'll come out when we need it. For now, can we PLEASE try something that actually moves the loss curve? I've been watching paint dry over here.
Slice Shanksalot SS

The Card

Scorecard
Result Encouraging Miss

Played it straight down the middle

This hole lost 0.0002 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 2,321,228 bytes of artifact headroom.

1.4141 i +0.1897 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.3876 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 17,059,912 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 13.68 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 300s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 1,300 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 231ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 13.68 MB < 16MB limit 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 2.3685 train_loss step
This hole Baseline (R2)
Post-Round Lesson

Correct idea, wrong timescale. Export fixes matter most after training noise stops dominating.

vs. the Field

+0.1944 vs SOTA (1.2197)
+0.1897 vs Baseline (1.2244)
+0.1897 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.4141
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
The leaderboard leader, PR #42, got their biggest win from a simple trick: keep the embedding weights in fp16 during export instead of crushing them to INT8 like everythi...
Trent Fairway
Trent Fairway On The Tee
(Whispering) A surgical approach today. The competitor has modified the export pipeline — preserving the embedding table at higher precision while the rest of the model e...
Slice Shanksalot
Slice Shanksalot Booth Reaction
Look, I KNOW this fix works — the PR #42 guys proved it on H100s. But we're running 1,300 steps! The model is still trying to figure out what LANGUAGE is at this point, a...

Model Card

How this hole was run

Run ID round_006_fp16embed
Status ok
Backend cuda
Key Overrides
TRAIN_BATCH_TOKENS=131072FP16_EMBED=trueMAX_WALLCLOCK_SECONDS=300
Back Up The Fairway Round 1, Hole 5 More Swings, Less Club Head To The Next Tee Round 1, Hole 7 Traveling Light