Honestly though with that little memory I'd stick to running against hosted LLMs - Claude 3.7 Sonnet, Gemini 2.5 Pro, o4-mini are all cheap enough that it's hard to spend much money with them for most coding workflows.
How about on an MacBook Pro M2 Max with 64GB RAM? Any recommendations for local models for coding on that?
I tried to run some of the differently sized DeepSeek R1 locally when those had recently come out, but couldn’t manage at the time to run any of them. And I had to download a lot of data to try those. So if you know a specific size of DeepSeek R1 that will work on 64GB RAM on MacBook Pro M2 Max, or another great local LLM for coding on that, that would be super appreciated
I imagine that this in quantized form would fit pretty well and be decent. (Qwen R1 32b[1] or Qwen 3 32b[2])
Specifically the `Q6_K` quant looks solid at ~27gb. That leaves enough headroom on your 64gb Macbook that you can actually load a decent amount of context. (It takes extra VRAM for every token of context you need)
Rough math, based on this[0] calculator is that it's around ~10gb per 32k tokens of context. And that doesn't seem to change based on using a different quant size -- you just have to have enough headroom.
So with 64gb:
- ~25gb for Q6 quant
- 10-20gb for context of 32-64k
That leaves you around 20gb for application memory and _probably_ enough context to actually be useful for larger coding tasks! (It just might be slow, but you can use a smaller quant to get more speed.)
I really like Mistral Small 3.1 (I have a 64GB M2 as well). Qwen 3 is worth trying in different sizes too.
I don't know if they'll be good enough for general coding tasks though - I've been spoiled by API access to Claude 3.7 Sonnet and o4-mini and Gemini 2.5 Pro.
How do you determine peak memory usage? Just look at activity monitor?
I've yet to find a good overview of how much memory each model needs for different context lengths (other than back of the envelope #weights * bits). LM Studio warns you if a model will likely not fit, but it's not very exact.
There are plenty of smaller (quantized) models that fit well on your machine! On a M4 with 24GB it’s already possible to comfortably run 8B quantized models.
Qwen 3 8B on MLX runs in just 5GB of RAM and can write basic code but I don't know if it would be good enough for anything interesting: https://simonwillison.net/2025/May/2/qwen3-8b/
Honestly though with that little memory I'd stick to running against hosted LLMs - Claude 3.7 Sonnet, Gemini 2.5 Pro, o4-mini are all cheap enough that it's hard to spend much money with them for most coding workflows.