April 2026 TLDR Setup for Ollama and Gemma 4 26B on a Mac mini

318 points - last Friday at 9:35 AM

Comments

Aurornis last Friday at 2:44 PM

If this is your first time using open weight models right after release, know that there are always bugs in the early implementations and even quantizations.

Every project races to have support on launch day so they don’t lose users, but the output you get may not be correct. There are already several problems being discovered in tokenizer implementations and quantizations may have problems too if they use imatrix.

So you’re going to see a lot of “I tried it but it sucks because it can’t even do tool calls” and other reports about how the models don’t work at all in the coming weeks from people who don’t realize they were using broken implementations.

If you want to try cutting edge open models you need to be ready to constantly update your inference engine and check your quantization for updates and re-download when it’s changed. The mad rush to support it on launch day means everything gets shipped as soon as it looks like it can produce output tokens, not when it’s tested to be correct.

logicallee last Friday at 12:06 PM

In case someone would like to know what these are like on this hardware, I tested Gemma 4 32b (the ~20 GB model, the largest Gemma model Google published) and Gemma 4 gemma4:e4b (the ~10 GB model) on this exact setup (Mac Mini M4 with 24 GB of RAM using Ollama), I livestreamed it:

https://www.youtube.com/live/G5OVcKO70ns

The ~10 GB model is super speedy, loading in a few seconds and giving responses almost instantly. If you just want to see its performance, it says hello around the 2 minute mark in the video (and fast!) and the ~20 GB model says hello around 5 minutes 45 seconds in the video. You can see the difference in their loading times and speed, which is a substantial difference. I also had each of them complete a difficult coding task, they both got it correct but the 20 GB model was much slower. It's a bit too slow to use on this setup day to day, plus it would take almost all the memory. The 10 GB model could fit comfortably on a Mac Mini 24 GB with plenty of RAM left for everything else, and it seems like you can use it for small-size useful coding tasks.

neo_doom last Friday at 4:04 PM

Huge Claude user here… can someone help me set some realistic expectations if I bought a Mac mini and spun one up? I use Claude primarily for dev work and Home Lab projects. Are the open models good enough to run locally and replace the Claude workload? Or am I better off with my $20/mo Claude subscription?

milchek last Friday at 1:06 PM

I tested briefly with a MacBook Pro m4 with 36gb. Run in LM Studio with open code as the frontend and it failed over and over on tool calls. Switched back to qwen. Anyone else on similar setup have better luck?

anonyfox last Friday at 2:07 PM

M5 air here with 32gb ram and 10/10 cores. Anyone got some luck with mlx builds on oMLX so far? Not at my machine right now and would love to know if these models already work including tool calling

jasonriddle last Friday at 6:17 PM

Slightly off topic, but question for folks.

I'm hoping to replace coding with Claude Sonnet 4.5 with a model with an open source or open weights model. Are any of the models on Ollama.com cloud offering (https://ollama.com/search?c=cloud) or any of the models on OpenRouter.ai a close replacement? I know that no model right now matches the full performance and capabilities of Claude Sonnet 4.5, but I want to know how close I can get and with which model(s).

If there is a model you say can replace it, talk about how long you have been using it for, and using what harness (Claude code, opencode, etc), and some strengths and weakness you have noticed. I'm not interested in what benchmarks say, I want to hear about real world use from programmers using these models.

spencer-p last Friday at 4:41 PM

Weird that the steps are for "Gemma 4 12b", which does not exist, and then switches to 26b midway through.

There's also a step to verify that it doesn't fit on the GPU with ollama ps showing "14%/86% CPU/GPU". Doesn't this mean you'll have really bad performance?

pwr1 last Friday at 8:17 PM

Running 26B locally is impressive but the latency math gets rough once your doing anything beyond chat. We switched from local inference to API calls for image generation specifically because cold start + generation time on consumer hardware made it impractical for any kind of automated workflow.

Local is great for experimentation but production workloads that need to run reliably at specific times still favor API imo. That said for privacy sensitive use cases where data cant leave the machine, setups like this are invaluable.

easygenes last Friday at 10:40 AM

Why is ollama so many people’s go-to? Genuinely curious, I’ve tried it but it feels overly stripped down / dumbed down vs nearly everything else I’ve used.

Lately I’ve been playing with Unsloth Studio and think that’s probably a much better “give it to a beginner” default.

aetherspawn last Friday at 12:28 PM

Which harness (IDE) works with this if any? Can I use it for local coding right now?

Xentyon yesterday at 6:51 AM

Nice setup. Running models locally on Mac hardware has gotten surprisingly viable. I'm using a similar stack in Switzerland for testing AI agent workflows — the M-series chips handle inference well for tool-calling tasks.

boutell last Friday at 11:38 AM

Last night I had to install the VO.20 pre-release of ollama to use this model. So I'm wondering if these instructions are accurate.

redrove last Friday at 10:25 AM

There is virtually no reason to use Ollama over LM Studio or the myriad of other alternatives.

Ollama is slower and they started out as a shameless llama.cpp ripoff without giving credit and now they “ported” it to Go which means they’re just vibe code translating llama.cpp, bugs included.

kristopolous last Friday at 3:27 PM

Are you getting tool call and multimodal working? I don't see it in the quantized unsloth ggufs...

amelius last Friday at 10:32 PM

Has anyone tried to run it on a Jetson Orin AGX with 64GB unified memory?

OkGoDoIt last Friday at 7:07 PM

Sorry for being off topic, but why can’t I open this without being logged into GitHub? I thought gists are either completely private or publicly accessible. Are they no longer publicly accessible?

kilzimir last Friday at 8:08 PM

Kinda crazy that I can run a 26B model on a 1500€ laptop (MacBook Air M5 32GB). Does anyone know how I can actually use this in a productive way?

zachperkel last Friday at 3:33 PM

how many TPS does a build like this achieve on gemma 4 26b?

renewiltord last Friday at 2:41 PM

Just told Claude to sort it out and it ran it. 26 tok/s on the Mac mini I use for personal claw type program. Unusable for local agent but it’s okay.

robotswantdata last Friday at 10:57 AM

Why are you using Ollama? Just use llama.cpp

brew install llama.cpp

use the inbuilt CLI, Server or Chat interface. + Hook it up to any other app

techpulselab last Friday at 4:13 PM

[dead]

aplomb1026 last Friday at 10:05 PM

[dead]

aplomb1026 last Friday at 5:31 PM

[dead]

jiusanzhou last Friday at 4:15 PM

[dead]

volume_tech last Friday at 1:11 PM

[dead]

kanehorikawa last Friday at 3:45 PM

[dead]

greenstevester last Friday at 9:35 AM

[flagged]

mark_l_watson last Friday at 1:58 PM

The article has a few good tips for using Ollama. Perhaps it should note that the Gemma 4 models are not really trained for strong performance with coding agents like OpenCode, Claude Code, pi, etc. The Gemma 4 models are excellent for applications requiring tool use, data extraction to JSON, etc. I asked Gemini Pro about this earlier and Gemini Pro recommended qwen 3.5 models specifically for coding, and backed that up with interesting material on training. This makes sense, and is something that I do: use strong models to build effective applications using small efficient models.