Yes. I collected some details here: https://simonwillison.net/2026/Mar/18/llm-in...

anemll · 2026-03-23T18:48:55 1774291735

Thanks for posting this, that's how I first found out about Dan's experiment! SSD speed doubled in the M5P/M generation, that makes it usable! I think one paper under the radar is "KV Prediction for Improved Time to First Token" https://arxiv.org/abs/2410.08391 which hopefully can help with prefill for Flash streaming.

Yukonv · 2026-03-23T19:12:39 1774293159

That’s exactly what I thought about. Getting my hands on an M5 Max this week and going to see hows Dan’s experiment performs with faster I/O. Also going to experiment with running active parameters at Q6 or Q8 since output is I/O bottlenecked there should room for higher accuracy compute.

anemll · 2026-03-23T19:19:23 1774293563

Check my repo, I had added some support for GUFF/untloth, Q3,Q5/Q8 https://github.com/Anemll/flash-moe/blob/iOS-App/docs/gguf-h...

3abiton · 2026-03-23T21:18:50 1774300730

To be fair, it's "possible" to run such setup with llama.cpp with ssd offload. It's just abysmal TG speeds. But it's possible.

superjan · 2026-03-23T17:23:28 1774286608

That was a very good summary. One detail the post could use is mentioning that 4 or 10 experts invoked where selected from the 512 experts the model has per layer (to give an idea of the savings).

trebligdivad · 2026-03-23T22:03:33 1774303413

I guess this is all set up to show off the new high-bandwidth-flash stuff that's due out soon?