That's pretty tough - the problem is that you need to have RAM left over to run ...

codetrotter · 2025-05-08T01:40:22 1746668422

How about on an MacBook Pro M2 Max with 64GB RAM? Any recommendations for local models for coding on that?

I tried to run some of the differently sized DeepSeek R1 locally when those had recently come out, but couldn’t manage at the time to run any of them. And I had to download a lot of data to try those. So if you know a specific size of DeepSeek R1 that will work on 64GB RAM on MacBook Pro M2 Max, or another great local LLM for coding on that, that would be super appreciated

freeqaz · 2025-05-08T02:15:30 1746670530

I imagine that this in quantized form would fit pretty well and be decent. (Qwen R1 32b[1] or Qwen 3 32b[2])

Specifically the `Q6_K` quant looks solid at ~27gb. That leaves enough headroom on your 64gb Macbook that you can actually load a decent amount of context. (It takes extra VRAM for every token of context you need)

Rough math, based on this[0] calculator is that it's around ~10gb per 32k tokens of context. And that doesn't seem to change based on using a different quant size -- you just have to have enough headroom.

So with 64gb:

- ~25gb for Q6 quant

- 10-20gb for context of 32-64k

That leaves you around 20gb for application memory and _probably_ enough context to actually be useful for larger coding tasks! (It just might be slow, but you can use a smaller quant to get more speed.)

I hope that helps!

0: https://huggingface.co/spaces/NyxKrage/LLM-Model-VRAM-Calcul...

1: https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-32...

2: https://huggingface.co/Qwen/Qwen3-32B-GGUF

simonw · 2025-05-08T03:45:57 1746675957

I really like Mistral Small 3.1 (I have a 64GB M2 as well). Qwen 3 is worth trying in different sizes too.

I don't know if they'll be good enough for general coding tasks though - I've been spoiled by API access to Claude 3.7 Sonnet and o4-mini and Gemini 2.5 Pro.

aukejw · 2025-05-08T10:04:32 1746698672

How do you determine peak memory usage? Just look at activity monitor?

I've yet to find a good overview of how much memory each model needs for different context lengths (other than back of the envelope #weights * bits). LM Studio warns you if a model will likely not fit, but it's not very exact.

simonw · 2025-05-08T12:12:42 1746706362

MLX reports peak memory usage at the end of the response. Otherwise I'll use Activity Monitor.

aukejw · 2025-05-08T12:46:06 1746708366

I'm also trusting `get_peak_memory` + some small buffer for now.

Still, it reports accurate peak memory usage for tensors living on GPU, but seems to miss some of the non-Metal overhead, however small (https://github.com/aukejw/mlx_transformers_benchmark/issues/...).

aukejw · 2025-05-08T06:41:41 1746686501

There are plenty of smaller (quantized) models that fit well on your machine! On a M4 with 24GB it’s already possible to comfortably run 8B quantized models.

Im benchmarking runtime and memory usage for a few of them: https://aukejw.github.io/mlx_transformers_benchmark/