Hole 8: The Skinny Iron — Gradient Descent Country Club

Round 1, Hole 8 Triple Bogey March 19, 2026

1.4270

compression score

What this score means

Quick read before we head down the fairway.

Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.

vs baseline: +0.2026 vs last hole: +0.0130

Tee Box R1 · H8

Artifact 10.25 MB

Headroom 5.75 MB Room left under 16 MB

Tempo 200 ms 1,501 steps

Looper The Caddie

Calculated Risk

The MLP eats 55% of all parameters at MLP_MULT=2. That's over 9 million params just for the feed-forward layers. What if we halve it? MLP_MULT=1 cuts the MLP to 512→512 instead of 512→1024. Saves 4.7 million parameters and makes each step faster. The trade: less "thinking" capacity per layer. But we get 1500 steps instead of 1300. Let's see if volume beats capacity.

Technical Read

A much smaller MLP might lose some per-step quality, but the extra throughput could still win under a short wallclock.

Trent Fairway On the Tee

(Whispering) A rather aggressive club selection from the caddy today. The MLP — the model's primary knowledge store — has been halved. Twelve million parameters where there were seventeen. One notes the gallery exchanging concerned glances. But the caddy points to the clock: more steps, more swings, more chances.

Looper’s Pick

The MLP eats 55% of all parameters at MLP_MULT=2. That’s over 9 million params just for the feed-forward layers. What if we halve it? MLP_MULT=1 cuts the MLP to 512→512 instead of 512→1024. Saves 4.7 million parameters and makes each step faster. The trade: less “thinking” capacity per layer. But we get 1500 steps instead of 1300. Let’s see if volume beats capacity.

The Shot — Halving the MLP

What does the MLP do, and what happens when you cut it in half?

In golf, your driver gets you distance and your irons get you accuracy. The attention mechanism is the driver — it scans the whole context and gathers relevant information. The MLP (feed-forward network) is the iron — it takes that gathered information and processes it, extracting meaning and storing knowledge.

Each transformer block has both: attention first, then MLP. The MLP works by expanding the representation to a wider hidden dimension (controlled by MLP_MULT), applying a nonlinearity (ReLU² in our case), then projecting back down. At MLP_MULT=2, a 512-dimensional input expands to 1024 dimensions internally. This expansion is where the model stores factual associations and performs its “thinking.”

The MLP accounts for 55% of all parameters in the baseline — over 9 million of 17 million total. By setting MLP_MULT=1 (no expansion), we slash this to ~4.7 million, saving 28% of total parameters. The model drops from 17M to 12.3M params, and each step gets faster (200ms vs 229ms), giving us more steps in the same wall time.

The risk is real, though. Research consistently shows that MLP capacity is where transformers store knowledge. Cutting it in half is like replacing your 7-iron with a pitching wedge — you can still hit the ball, but you’ve lost range. The question is whether the extra 200 steps from the speed gain compensate for the reduced per-step quality.

On the Tee

(Whispering) A rather aggressive club selection from the caddy today. The MLP — the model’s primary knowledge store — has been halved. Twelve million parameters where there were seventeen. One notes the gallery exchanging concerned glances. But the caddy points to the clock: more steps, more swings, more chances.

Results

Metric	Value
val_bpb	1.4270
val_loss	2.4095
params	12,341,320
artifact	10.25 MB (yes < 16MB)
wall time	300s
steps completed	1,501
step avg	200ms
peak memory	2,474 MiB

vs Hole 5 (MLP_MULT=2)

	Hole 5 (MLP=2)	Hole 8 (MLP=1)
val_bpb	1.4139	1.4270
params	17,059,912	12,341,320 (-28%)
steps	1,309	1,501 (+15%)
step avg	229ms	200ms (-13%)
artifact	13.35 MB	10.25 MB

Faster and smaller, but 0.013 BPB worse. The capacity loss outweighed the step gain.

The Booth Reacts

Trent: One-point-four-two-seven-zero. (Measured pause) That is… not an improvement. Thirteen thousandths worse than the baseline configuration. One observes that the model is now significantly smaller — ten megabytes, with ample room beneath the sixteen-megabyte ceiling — and faster per step. But the MLP, it would seem, earns its keep. Fifty-five percent of the parameters, and apparently, fifty-five percent of the model’s ability to make sense of language. A lesson in the value of iron play.

Slice: I TOLD you, you can’t just rip the engine out of a car and expect it to go faster because it’s lighter! OK, I didn’t actually tell you that, but I SHOULD have. Look — 1,501 steps and it STILL can’t match 1,309 steps with the full MLP. The MLP is doing WORK. Those parameters aren’t freeloaders like the KV heads. They’re the BACKBONE. (Pointing at camera) Respect the feed-forward network, people. It’s not glamorous, but it’s where the magic happens.

The Booth Reacts

Trent Fairway

One-point-four-two-seven-zero. (Measured pause) That is... not an improvement. Thirteen thousandths worse than the baseline configuration. One observes that the model is now significantly smaller — ten megabytes, with ample room beneath the sixteen-megabyte ceiling — and faster per step. But the MLP, it would seem, earns its keep. Fifty-five percent of the parameters, and apparently, fifty-five percent of the model's ability to make sense of language. A lesson in the value of iron play.

Slice Shanksalot

I TOLD you, you can't just rip the engine out of a car and expect it to go faster because it's lighter! OK, I didn't actually tell you that, but I SHOULD have. Look — 1,501 steps and it STILL can't match 1,309 steps with the full MLP. The MLP is doing WORK. Those parameters aren't freeloaders like the KV heads. They're the BACKBONE. (Pointing at camera) Respect the feed-forward network, people. It's not glamorous, but it's where the magic happens.

The Card

Scorecard

Result Dead End

Dropped a shot versus the last hole

This hole lost 0.0130 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 5,753,359 bytes of artifact headroom.

1.4270 i +0.2026 val bpb 2.4095 i val loss 12,341,320 i params 10.25 MB i artifact 300s i wall time 1,501 i steps 200ms i step avg

0 MB 10.25 MB < 16MB limit 16 MB

Training Curve

This hole Baseline (R2)

Post-Round Lesson

Too much club left in the locker. The speed gain was real, but the capacity loss was worse.

vs. the Field

+0.2073 vs SOTA (1.2197)

+0.2026 vs Baseline (1.2244)

+0.2026 vs Our Best (1.2244)

SOTA
1.2197

Baseline
1.2244

Our Best
1.2244

This Hole
1.4270

← better

Signature Voices

Post-round notebook notes from the tower, the caddie book, and the cheap seats.

Looper Caddie Notebook

The MLP eats 55% of all parameters at MLP_MULT=2. That's over 9 million params just for the feed-forward layers. What if we halve it? MLP_MULT=1 cuts the MLP to 512→512 i...

Trent Fairway On The Tee

(Whispering) A rather aggressive club selection from the caddy today. The MLP — the model's primary knowledge store — has been halved. Twelve million parameters where the...

Slice Shanksalot Booth Reaction

I TOLD you, you can't just rip the engine out of a car and expect it to go faster because it's lighter! OK, I didn't actually tell you that, but I SHOULD have. Look — 1,5...

Model Card

How this hole was run

Run ID round_008_mlp1

Status ok

Backend cuda

Key Overrides

MLP_MULT=1TRAIN_BATCH_TOKENS=131072MAX_WALLCLOCK_SECONDS=300

Back Up The Fairway Round 1, Hole 7 Traveling Light Head To The Next Tee Round 1, Hole 9 The Short Game