Est. 2026 • Members Welcome

Gradient Descent Country Club

Members-only scorecards from the Parameter Golf circuit

← All Holes
Round 1, Hole 8 Triple Bogey March 19, 2026

1.4270

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.2026 vs last hole: +0.0130
Tee Box R1 · H8
Artifact 10.25 MB
Headroom 5.75 MB Room left under 16 MB
Tempo 200 ms 1,501 steps
Looper
Looper The Caddie
Calculated Risk

The MLP eats 55% of all parameters at MLP_MULT=2. That's over 9 million params just for the feed-forward layers. What if we halve it? MLP_MULT=1 cuts the MLP to 512→512 instead of 512→1024. Saves 4.7 million parameters and makes each step faster. The trade: less "thinking" capacity per layer. But we get 1500 steps instead of 1300. Let's see if volume beats capacity.

Technical Read

A much smaller MLP might lose some per-step quality, but the extra throughput could still win under a short wallclock.

Trent Fairway
Trent Fairway On the Tee

(Whispering) A rather aggressive club selection from the caddy today. The MLP — the model's primary knowledge store — has been halved. Twelve million parameters where there were seventeen. One notes the gallery exchanging concerned glances. But the caddy points to the clock: more steps, more swings, more chances.

Looper’s Pick

The MLP eats 55% of all parameters at MLP_MULT=2. That’s over 9 million params just for the feed-forward layers. What if we halve it? MLP_MULT=1 cuts the MLP to 512→512 instead of 512→1024. Saves 4.7 million parameters and makes each step faster. The trade: less “thinking” capacity per layer. But we get 1500 steps instead of 1300. Let’s see if volume beats capacity.

The Shot — Halving the MLP

What does the MLP do, and what happens when you cut it in half?

In golf, your driver gets you distance and your irons get you accuracy. The attention mechanism is the driver — it scans the whole context and gathers relevant information. The MLP (feed-forward network) is the iron — it takes that gathered information and processes it, extracting meaning and storing knowledge.

Each transformer block has both: attention first, then MLP. The MLP works by expanding the representation to a wider hidden dimension (controlled by MLP_MULT), applying a nonlinearity (ReLU² in our case), then projecting back down. At MLP_MULT=2, a 512-dimensional input expands to 1024 dimensions internally. This expansion is where the model stores factual associations and performs its “thinking.”

The MLP accounts for 55% of all parameters in the baseline — over 9 million of 17 million total. By setting MLP_MULT=1 (no expansion), we slash this to ~4.7 million, saving 28% of total parameters. The model drops from 17M to 12.3M params, and each step gets faster (200ms vs 229ms), giving us more steps in the same wall time.

The risk is real, though. Research consistently shows that MLP capacity is where transformers store knowledge. Cutting it in half is like replacing your 7-iron with a pitching wedge — you can still hit the ball, but you’ve lost range. The question is whether the extra 200 steps from the speed gain compensate for the reduced per-step quality.

On the Tee

(Whispering) A rather aggressive club selection from the caddy today. The MLP — the model’s primary knowledge store — has been halved. Twelve million parameters where there were seventeen. One notes the gallery exchanging concerned glances. But the caddy points to the clock: more steps, more swings, more chances.

Results

MetricValue
val_bpb1.4270
val_loss2.4095
params12,341,320
artifact10.25 MB (yes < 16MB)
wall time300s
steps completed1,501
step avg200ms
peak memory2,474 MiB

vs Hole 5 (MLP_MULT=2)

Hole 5 (MLP=2)Hole 8 (MLP=1)
val_bpb1.41391.4270
params17,059,91212,341,320 (-28%)
steps1,3091,501 (+15%)
step avg229ms200ms (-13%)
artifact13.35 MB10.25 MB

Faster and smaller, but 0.013 BPB worse. The capacity loss outweighed the step gain.

The Booth Reacts

Trent: One-point-four-two-seven-zero. (Measured pause) That is… not an improvement. Thirteen thousandths worse than the baseline configuration. One observes that the model is now significantly smaller — ten megabytes, with ample room beneath the sixteen-megabyte ceiling — and faster per step. But the MLP, it would seem, earns its keep. Fifty-five percent of the parameters, and apparently, fifty-five percent of the model’s ability to make sense of language. A lesson in the value of iron play.

Slice: I TOLD you, you can’t just rip the engine out of a car and expect it to go faster because it’s lighter! OK, I didn’t actually tell you that, but I SHOULD have. Look — 1,501 steps and it STILL can’t match 1,309 steps with the full MLP. The MLP is doing WORK. Those parameters aren’t freeloaders like the KV heads. They’re the BACKBONE. (Pointing at camera) Respect the feed-forward network, people. It’s not glamorous, but it’s where the magic happens.

The Booth Reacts

Trent Fairway TF
Trent Fairway
One-point-four-two-seven-zero. (Measured pause) That is... not an improvement. Thirteen thousandths worse than the baseline configuration. One observes that the model is now significantly smaller — ten megabytes, with ample room beneath the sixteen-megabyte ceiling — and faster per step. But the MLP, it would seem, earns its keep. Fifty-five percent of the parameters, and apparently, fifty-five percent of the model's ability to make sense of language. A lesson in the value of iron play.
Slice Shanksalot
I TOLD you, you can't just rip the engine out of a car and expect it to go faster because it's lighter! OK, I didn't actually tell you that, but I SHOULD have. Look — 1,501 steps and it STILL can't match 1,309 steps with the full MLP. The MLP is doing WORK. Those parameters aren't freeloaders like the KV heads. They're the BACKBONE. (Pointing at camera) Respect the feed-forward network, people. It's not glamorous, but it's where the magic happens.
Slice Shanksalot SS

The Card

Scorecard
Result Dead End

Dropped a shot versus the last hole

This hole lost 0.0130 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 5,753,359 bytes of artifact headroom.

1.4270 i +0.2026 val bpb Bits per byte — the headline score How many bits the model needs, on average, to predict each byte of unseen text. This is the challenge's primary metric. It's tokenizer-agnostic, so models with different vocabularies can be compared fairly. Lower is better. The baseline scores 1.2244. 2.4095 i val loss Validation cross-entropy loss The model's prediction error on held-out text, measured in nats (natural log units). This is the raw loss before converting to bits-per-byte. Related to BPB by: BPB = (val_loss / ln(2)) × (tokens / bytes). Lower is better. 12,341,320 i params Total trainable parameters The number of individual weight values in the model. More parameters generally means more capacity to learn, but also a larger artifact. The 16MB limit constrains how many parameters you can afford — at INT8, roughly 16 million; at ternary (1.58 bits), roughly 80 million. 10.25 MB i artifact Compressed model + code size The total submission size: your training script's code bytes plus the model weights compressed via INT8 quantization and zlib. Must be under 16,000,000 bytes (decimal 16MB). The model is decompressed and dequantized before evaluation. 300s i wall time Training wall clock time Real-world elapsed time for the training run. The challenge caps training at 10 minutes on 8×H100 GPUs. Our L40S iteration runs use shorter time limits since we're just getting directional signal, not final scores. 1,501 i steps Training steps completed Each step processes one batch of tokens, computes the loss, and updates the model weights. More steps generally means a better-trained model. The number of steps you get depends on your batch size, GPU speed, and wall clock limit. 200ms i step avg Average time per training step How long each gradient update takes in milliseconds. Faster steps mean more training in the same wall clock budget. Affected by batch size, model size, and GPU capability. The 8×H100 baseline averages 43.5ms; our L40S averages ~230-1000ms depending on batch size.
0 MB 10.25 MB < 16MB limit 16 MB

Training Curve

2.0 3.0 4.0 5.0 6.0 7.0 500 1000 2.4006 train_loss step
This hole Baseline (R2)
Post-Round Lesson

Too much club left in the locker. The speed gain was real, but the capacity loss was worse.

vs. the Field

+0.2073 vs SOTA (1.2197)
+0.2026 vs Baseline (1.2244)
+0.2026 vs Our Best (1.2244)
SOTA
1.2197
Baseline
1.2244
Our Best
1.2244
This Hole
1.4270
← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper
Looper Caddie Notebook
The MLP eats 55% of all parameters at MLP_MULT=2. That's over 9 million params just for the feed-forward layers. What if we halve it? MLP_MULT=1 cuts the MLP to 512→512 i...
Trent Fairway
Trent Fairway On The Tee
(Whispering) A rather aggressive club selection from the caddy today. The MLP — the model's primary knowledge store — has been halved. Twelve million parameters where the...
Slice Shanksalot
Slice Shanksalot Booth Reaction
I TOLD you, you can't just rip the engine out of a car and expect it to go faster because it's lighter! OK, I didn't actually tell you that, but I SHOULD have. Look — 1,5...

Model Card

How this hole was run

Run ID round_008_mlp1
Status ok
Backend cuda
Key Overrides
MLP_MULT=1TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=300
Back Up The Fairway Round 1, Hole 7 Traveling Light Head To The Next Tee Round 1, Hole 9 The Short Game