Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 13 Bogey March 19, 2026

1.3414

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.1170 vs last hole: +0.0020
Tee Box R1 · H13
Artifact 16.20 MB
Headroom 0.00 MB Room left under 16 MB
Tempo 235 ms 2,552 steps
Looper
Looper The Caddie
Safe Tweaks

Hole 12 found gold but the suitcase was too big. The artifact hit 16.7MB — 700KB over the 16MB limit. The value embedding tables are the culprit: five tables at 262K params each. Let's try three tables instead, sharing across layer triplets instead of pairs. That cuts ~500KB from the artifact. If the quality holds, we're close to legal.

Technical Read

Reducing from 5 value embedding tables to 3 should shrink the artifact enough to fit under 16MB while preserving most of the quality.

Trent Fairway
Trent Fairway On the Tee

(Whispering) The competitor returns to the tee with a lighter bag. Three value embedding tables where there were five. The question is not whether the quality will hold — one rather suspects it will — but whether the arithmetic of compression will finally cooperate.

Looper’s Pick

Hole 12 found gold but the suitcase was too big. The artifact hit 16.7MB — 700KB over the 16MB limit. The value embedding tables are the culprit: five tables at 262K params each. Let’s try three tables instead, sharing across layer triplets instead of pairs. That cuts ~500KB from the artifact. If the quality holds, we’re close to legal.

The Shot — Fewer Value Embedding Tables

How much sharing can value embeddings tolerate?

In golf, you can share a caddy between two players in a casual round. Three players sharing one caddy is a stretch — the advice gets thinner, the reads get slower. But a great caddy can still help three players better than no caddy at all.

Value embeddings face the same sharing trade-off. In Hole 10, we used 5 tables shared across layer pairs (layers 0-1, 2-3, 4-5, 6-7, 8). Each pair got its own dedicated value embedding. Now we’re trying 3 tables shared across triplets (layers 0-2, 3-5, 6-8). Each table serves more layers, which means the embeddings can’t specialize as much for each layer’s specific needs.

The key question: were the 5 tables actually specializing, or were some of them learning redundant information? If adjacent layers want similar value embeddings anyway (which is plausible — nearby layers in a transformer tend to capture similar levels of abstraction), then sharing across triplets costs very little quality while saving ~500KB in the compressed artifact.

The savings come from having 3 × 262,144 = 786K params instead of 5 × 262,144 = 1.3M params. At INT8 + zlib, that’s roughly 500KB of compressed artifact size — enough to potentially bring us under the 16MB competition limit.

On the Tee

(Whispering) The competitor returns to the tee with a lighter bag. Three value embedding tables where there were five. The question is not whether the quality will hold — one rather suspects it will — but whether the arithmetic of compression will finally cooperate.

Results

MetricValue
val_bpb1.3414
val_loss2.2649
params~17,850,000
artifact16.20 MB (STILL over 16MB by 254KB)
wall time600s
steps completed2,552
step avg235ms

Value Embedding Table Count Comparison (10-min runs)

Tablesval_bpbArtifactUnder 16MB?
5 (Hole 12)1.339416.72 MBNo (-720KB)
3 (Hole 13)1.341416.20 MBNo (-254KB)
2 (next)???~15.9 MB?Hopefully

Quality essentially identical (0.002 BPB difference is noise). But still over budget. Need one more trim.

The Booth Reacts

Trent: One-point-three-four-one-four. (Nods approvingly) Virtually indistinguishable from the five-table version. The two discarded tables were, as one suspected, largely ornamental. However. (Adjusts glasses) Sixteen-point-two megabytes. Still over the line. Two hundred and fifty-four kilobytes over, to be precise. One more trim, one imagines, and we shall finally be within the ropes.

Slice: Two tables doing NOTHING and we were carrying them around like dead weight! Classic over-packing. But look — we’re SO close. 254KB. That’s like being 254 yards from the green on a par 5. One more good shot and we’re on the dance floor. Drop to two tables. If the quality holds again — and I bet it does — we’ve got a legal artifact AND a 1.34 BPB. That’s a card I’d sign.

The Booth Reacts

Trent Fairway TF
Trent Fairway
One-point-three-four-one-four. (Nods approvingly) Virtually indistinguishable from the five-table version. The two discarded tables were, as one suspected, largely ornamental. However. (Adjusts glasses) Sixteen-point-two megabytes. Still over the line. Two hundred and fifty-four kilobytes over, to be precise. One more trim, one imagines, and we shall finally be within the ropes.
Slice Shanksalot
Two tables doing NOTHING and we were carrying them around like dead weight! Classic over-packing. But look — we're SO close. 254KB. That's like being 254 yards from the green on a par 5. One more good shot and we're on the dance floor. Drop to two tables. If the quality holds again — and I bet it does — we've got a legal artifact AND a 1.34 BPB. That's a card I'd sign.
Slice Shanksalot SS

The Card

Scorecard
Result Encouraging Miss

Dropped a shot versus the last hole

This hole lost 0.0020 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 0 bytes of artifact headroom.

1.3414 i +0.1170 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.2649 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 17,850,000 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 16.20 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 600s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 2,552 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 235ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 16.20 MB > 16MB OVER 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 1500 2000 2.3897 train_loss step
This hole Baseline (R2)
Post-Round Lesson

3 tables perform the same as 5 — the extra two weren't pulling their weight. But still 254KB over budget. Need to trim further.

vs. the Field

+0.1217 vs SOTA (1.2197)
+0.1170 vs Baseline (1.2244)
+0.1170 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.3414
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
Hole 12 found gold but the suitcase was too big. The artifact hit 16.7MB — 700KB over the 16MB limit. The value embedding tables are the culprit: five tables at 262K para...
Trent Fairway
Trent Fairway On The Tee
(Whispering) The competitor returns to the tee with a lighter bag. Three value embedding tables where there were five. The question is not whether the quality will hold —...
Slice Shanksalot
Slice Shanksalot Booth Reaction
Two tables doing NOTHING and we were carrying them around like dead weight! Classic over-packing. But look — we're SO close. 254KB. That's like being 254 yards from the g...

Model Card

How this hole was run

Run ID round_013_valemb_slim
Status ok
Training Script train_gpt_valemb_slim.py
Backend cuda
Key Overrides
TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=600
Back Up The Fairway Round 1, Hole 12 Finishing the Swing Head To The Next Tee Round 1, Hole 14 Two Tables, Same Story