The MLP eats 55% of all parameters at MLP_MULT=2. That's over 9 million params just for the feed-forward layers. What if we halve it? MLP_MULT=1 cuts the MLP to 512→512 i...
1.4270
compression score
What this score means
Quick read before we head down the fairway.
Bits per byte is the challenge score: how many bits the model needs, on average, to predict each byte of unseen text. Lower is better.
The MLP eats 55% of all parameters at MLP_MULT=2. That's over 9 million params just for the feed-forward layers. What if we halve it? MLP_MULT=1 cuts the MLP to 512→512 instead of 512→1024. Saves 4.7 million parameters and makes each step faster. The trade: less "thinking" capacity per layer. But we get 1500 steps instead of 1300. Let's see if volume beats capacity.
A much smaller MLP might lose some per-step quality, but the extra throughput could still win under a short wallclock.
(Whispering) A rather aggressive club selection from the caddy today. The MLP — the model's primary knowledge store — has been halved. Twelve million parameters where there were seventeen. One notes the gallery exchanging concerned glances. But the caddy points to the clock: more steps, more swings, more chances.
Looper’s Pick
The MLP eats 55% of all parameters at MLP_MULT=2. That’s over 9 million params just for the feed-forward layers. What if we halve it? MLP_MULT=1 cuts the MLP to 512→512 instead of 512→1024. Saves 4.7 million parameters and makes each step faster. The trade: less “thinking” capacity per layer. But we get 1500 steps instead of 1300. Let’s see if volume beats capacity.
The Shot — Halving the MLP
What does the MLP do, and what happens when you cut it in half?
In golf, your driver gets you distance and your irons get you accuracy. The attention mechanism is the driver — it scans the whole context and gathers relevant information. The MLP (feed-forward network) is the iron — it takes that gathered information and processes it, extracting meaning and storing knowledge.
Each transformer block has both: attention first, then MLP. The MLP works by expanding the representation to a wider hidden dimension (controlled by MLP_MULT), applying a nonlinearity (ReLU² in our case), then projecting back down. At MLP_MULT=2, a 512-dimensional input expands to 1024 dimensions internally. This expansion is where the model stores factual associations and performs its “thinking.”
The MLP accounts for 55% of all parameters in the baseline — over 9 million of 17 million total. By setting MLP_MULT=1 (no expansion), we slash this to ~4.7 million, saving 28% of total parameters. The model drops from 17M to 12.3M params, and each step gets faster (200ms vs 229ms), giving us more steps in the same wall time.
The risk is real, though. Research consistently shows that MLP capacity is where transformers store knowledge. Cutting it in half is like replacing your 7-iron with a pitching wedge — you can still hit the ball, but you’ve lost range. The question is whether the extra 200 steps from the speed gain compensate for the reduced per-step quality.
On the Tee
(Whispering) A rather aggressive club selection from the caddy today. The MLP — the model’s primary knowledge store — has been halved. Twelve million parameters where there were seventeen. One notes the gallery exchanging concerned glances. But the caddy points to the clock: more steps, more swings, more chances.
Results
| Metric | Value |
|---|---|
| val_bpb | 1.4270 |
| val_loss | 2.4095 |
| params | 12,341,320 |
| artifact | 10.25 MB (yes < 16MB) |
| wall time | 300s |
| steps completed | 1,501 |
| step avg | 200ms |
| peak memory | 2,474 MiB |
vs Hole 5 (MLP_MULT=2)
| Hole 5 (MLP=2) | Hole 8 (MLP=1) | |
|---|---|---|
| val_bpb | 1.4139 | 1.4270 |
| params | 17,059,912 | 12,341,320 (-28%) |
| steps | 1,309 | 1,501 (+15%) |
| step avg | 229ms | 200ms (-13%) |
| artifact | 13.35 MB | 10.25 MB |
Faster and smaller, but 0.013 BPB worse. The capacity loss outweighed the step gain.
The Booth Reacts
Trent: One-point-four-two-seven-zero. (Measured pause) That is… not an improvement. Thirteen thousandths worse than the baseline configuration. One observes that the model is now significantly smaller — ten megabytes, with ample room beneath the sixteen-megabyte ceiling — and faster per step. But the MLP, it would seem, earns its keep. Fifty-five percent of the parameters, and apparently, fifty-five percent of the model’s ability to make sense of language. A lesson in the value of iron play.
Slice: I TOLD you, you can’t just rip the engine out of a car and expect it to go faster because it’s lighter! OK, I didn’t actually tell you that, but I SHOULD have. Look — 1,501 steps and it STILL can’t match 1,309 steps with the full MLP. The MLP is doing WORK. Those parameters aren’t freeloaders like the KV heads. They’re the BACKBONE. (Pointing at camera) Respect the feed-forward network, people. It’s not glamorous, but it’s where the magic happens.
The Booth Reacts
The Card
Dropped a shot versus the last hole
This hole lost 0.0130 on the compression score versus the previous stop. Lower is better here: it means the model predicts unseen text more efficiently while leaving 5,753,359 bytes of artifact headroom.
Training Curve
Too much club left in the locker. The speed gain was real, but the capacity loss was worse.
vs. the Field
1.2197
1.2244
1.2244
1.4270
Signature Voices
Post-round notebook notes from the tower, the caddie book, and the cheap seats.
(Whispering) A rather aggressive club selection from the caddy today. The MLP — the model's primary knowledge store — has been halved. Twelve million parameters where the...
I TOLD you, you can't just rip the engine out of a car and expect it to go faster because it's lighter! OK, I didn't actually tell you that, but I SHOULD have. Look — 1,5...
Model Card
How this hole was run
round_008_mlp1 ok cuda