Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model

880 points - yesterday at 1:19 PM

Source

Comments

simonw yesterday at 4:46 PM
The pelican is excellent for a 16.8GB quantized local model: https://simonwillison.net/2026/Apr/22/qwen36-27b/

I ran it on an M5 Pro with 128GB of RAM, but it only needs ~20GB of that. I expect it will run OK on a 32GB machine.

Performance numbers:

  Reading: 20 tokens, 0.4s, 54.32 tokens/s
  Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s
I like it better than the pelican I got from Opus 4.7 the other day: https://simonwillison.net/2026/Apr/16/qwen-beats-opus/
finnjohnsen2 yesterday at 9:07 PM
Since Gemma 4 came this easter the gap from self hosting models to Claude has decreased sigificantly I think. The gap is still huge it just that local models were extremely non-competitive before easter. So now it seems Qwen 3.6 is another bump up from Gemma 4 which is exciting if it is so. I keep an Opus close ofcourse, because these local models still wander off in the wrong direction and fails. Something Opus almost never does for me anymore.

But every time a local model gets me by - I feel closer to where I should be; writing code should still be free. Both free as in free beer, and free as in freedom.

My setup is a seperate dedicated Ubuntu machine with RTX 5090. Qwen 3.6:27b uses 29/32gb of vram when its working right this minute. I use Ollama in a non root podman instance. And I use OpenCode as ACP Service for my editor, which I highly recommend. ACP (Agent Client Protocol) is how the world should be in case you were asking, which you didnt :)

Exciting times and thank you Qwen team for making the world a better place in a world of Sam Altmans.

pja today at 9:51 AM
I know this is kind of old hat by now, but it kind of blows my mind that I can upload a hand drawn decision tree & get a transcribed dot file back on consumer hardware using a pile of linear algebra that wasn’t even particularly specialised for this purpose, it’s just a capability that it picked up along with everything else during training.
anonzzzies yesterday at 3:16 PM
I wish that all announcements of models would show what (consumer) hardware you can run this on today, costs and tok/s.
jameson yesterday at 4:13 PM
What competitive advantage does OpenAI/Anthropic has when companies like Qwen/Minimax/etc are open sourcing models that shows similar (yet below than OpenAI/Anthropic) benchmark results?

Also, the token prices of these open source models are at a fraction of Anthropic's Opus 4.6[1]

[1]: https://artificialanalysis.ai/models/#pricing

syntaxing yesterday at 4:45 PM
Been using Qwen 3.6 35B and Gemma 4 26B on my M4 MBP, and while it’s no Opus, it does 95% of what I need which is already crazy since everything runs fully local.
Avlin67 today at 12:22 PM
130 token per seconds on dual RTX 5090 for FP8 version
sietsietnoac yesterday at 3:43 PM
Generate an SVG of a pelican riding a bicycle: https://codepen.io/chdskndyq11546/pen/yyaWGJx

Generate an SVG of a dragon eating a hotdog while driving a car: https://codepen.io/chdskndyq11546/pen/xbENmgK

Far from perfect, but it really shows how powerful these models can get

zkmon yesterday at 7:48 PM
On llama server, the Q4_K_M is giving about 91k context on 24GB, which calculates to about 70MB per 1K context (KV-Cache). I could have gone for Q5 which probably leaves about 30K token space. I think this is pretty impressive.
datadrivenangel yesterday at 9:46 PM
So far I'm unimpressed for local inference. got 11 tokens per second on omlx on an M5 Pro with 128gb of ram, so it took an hour to write a few hundred lines of code that didn't work. Opus and Sonnet in CC the same task successfully in a matter of minutes. The 3.6:35b model seemed okay on ollama yesterday.

Need to check out other harnesses for this besides claude code, but the local models are just painfully slow.

yamajun93 today at 10:47 AM
The benchmarks looks great for a 27B model, but Im curious how it will perform locally. I have tried a bunch of open-source models, and still I feel we are far from getting a similar output to claude code running in my M3 in my macbook pro.
mark_l_watson yesterday at 5:31 PM
I have been running the slightly larger 31B model for local coding:

ollama launch claude --model qwen3.6:35b-a3b-nvfp4

This has been optimized for Apple Silicon and runs well on a 32G ram system. Local models are getting better!

vladgur yesterday at 3:17 PM
This is getting very close to fit a single 3090 with 24gb VRAM :)
vibe42 yesterday at 4:04 PM
Q4-Q5 quants of this model runs well on gaming laptops with 24GB VRAM and 64GB RAM. Can get one of those for around $3,500.

Interesting pros/cons vs the new Macbook Pros depending on your prefs.

And Linux runs better than ever on such machines.

originalvichy yesterday at 3:19 PM
Good news!

Friendly reminder: wait a couple weeks to judge the ”final” quality of these free models. Many of them suffer from hidden bugs when connected to an inference backend or bad configs that slow them down. The dev community usually takes a week or two to find the most glaring issues. Some of them may require patches to tools like llama.cpp, and some require users to avoid specific default options.

Gemma 4 had some issues that were ironed out within a week or two. This model is likely no different. Take initial impressions with a grain of salt.

2001zhaozhao yesterday at 6:22 PM
I'm kind of interested in a setup where one buys local hardware specifically to run a crap ton of small-to-medium LLM locally 24/7 at high throughput. These models might now be smart enough to make all kinds of autonomous agent workflows viable at a cheap price, with a good queue prioritization system for queries to fully utilize the hardware.
n8henrie yesterday at 11:47 PM
I'm still fairly new to local LLMs, spent some time setting up and testing a few Qwen3.6-35B-A3B models yesterday (mlx 4b and 8b, gguf Q4_K_M and Q4_K_XL I think).

Was impressed at how they ran on my 64G M4.

It looks like this new model is slightly "smarter" (based on the tables in TFA) but requires more VRAM. Is that it? The "dense" part being the big deal?

As 27B < 35B, should we expect some quantized models soon that will bring the VRAM requirement down?

lgessler yesterday at 7:29 PM
I'll be really interested to hear qualitative reports of how this model works out in practice. I just can't believe that a model this small is actually as good as Opus, which is rumored to be about two orders of magnitude larger.
navbaker yesterday at 7:51 PM
TIL that our corporate network site blocker classifies qwen.ai as a sex site…
docheinestages yesterday at 7:15 PM
Has anyone tried using this with a Claude Code or Qwen Code? They both require very large context windows (32k and 16k respectively), which on a Mac M4 48GB serving the model via LM Studio is painfully slow.
mft_ yesterday at 9:31 PM
Huh, running the Q4_K_M quant with LM Studio, and asked it "How can I set up Qwen 3.6 27b to use tools and access the local file system?".

Part of its reply was: Quick clarification: As of early 2025, "Qwen 3.6" hasn't been released yet. You are likely looking for Qwen2.5, specifically the Qwen2.5-32B-Instruct model, which is the 30B-class model closest to your 27B reference. The instructions below will use this model.

Weird.

amunozo yesterday at 2:15 PM
A bit skeptical about a 27B model comparable to opus...
storus yesterday at 9:36 PM
If this runs at Opus 4.5 level for agentic coding then I don't really need any cloud models anymore.
htrp yesterday at 7:34 PM
Any comparisons against Qwen3.6-35B-A3B?
UncleOxidant yesterday at 3:45 PM
I've been waiting for this one. I've been using 3.5-27b with pretty good success for coding in C,C++ and Verilog. It's definitely helped in the light of less Claude availability on the Pro plan now. If their benchmarks are right then the improvement over 3.5 should mean I'm going to be using Claude even less.
richstokes yesterday at 8:15 PM
Are there benchmarks of this / what’s the best way to compare it against paid models? With all the rate limiting in Claude/Copilot/etc, running locally is more and more appealing.
pama yesterday at 3:07 PM
Has anyone tested it at home yet and wants to share early impressions?
butz yesterday at 5:23 PM
Are there any "optimized" models, that have lesser hardware requirements and are specialised in single programming language, e.g. C# ?
RandyOrion today at 3:13 AM
Thank you Qwen team. Small DENSE LLMs shapes the future of local LLM users.

When Qwen 3.5 27b released, I didn't really understand why linear attention is used instead of full attention because of the performance degradation and problems introduced with extra (linear) operators. After doing some tests, I found that with llama.cpp and IQ4_XS quant, the model and BF16 cache of the whole 262k context just fit on 32GB vram, which is impossible with full attention. In contrast, with gemma 4 31b IQ4_XS quant I have to use Q8_0 cache to fit 262k context on the vram, which is a little annoying (no offenses, thank you gemma team, too).

From benchmarks, 3.5->3.6 upgrade is about agent things. I hope future upgrades fix some problems I found, e.g., output repetitiveness in long conversations and knowledge broadness.

xrd yesterday at 6:44 PM
I'm experimenting with this on my RTX 3090 and opencode. It is pretty impressive so far.
reddit_clone today at 12:12 AM
What can I run on a M4 Pro with 48 GB or RAM?
thot_experiment yesterday at 8:58 PM
no FIM though :(, imo most slept on usecase for local models
jedisct1 yesterday at 6:11 PM
I really like local models for code reviews / security audits.

Even if they don't run super fast, I can let them work overnight and get comprehensive reports in the morning.

I used Qwen3.6-27B on an M5 (oq8, using omlx) and Swival (https://swival.dev) /audit command on small code bases I use for benchmarking models for security audits.

It found 8 out of 10, which is excellent for a local model, produced valid patches, and didn't report any false positives. which is even better.

vocoda yesterday at 8:52 PM
I wonder why they did not compare it to Qwen Coder Next?
deleted today at 7:33 AM
Mr_Eri_Atlov yesterday at 3:54 PM
Excited to try this, the Qwen 3.6 MoE they just released a week or so back had a noticeable performance bump from 3.5 in a rather short period of time.

For anyone invested in running LLMs at home or on a much more modest budget rig for corporate purposes, Gemma 4 and Qwen 3.6 are some of the most promising models available.

spwa4 yesterday at 3:19 PM
blurbleblurble yesterday at 7:15 PM
It's a rap on claude
LowLevelKernel yesterday at 5:50 PM
How much VRAM is needed?
objektif yesterday at 7:21 PM
Does anyone know good provider for low latency llm api provider? We tried to look at Cerebras and Groq but they have 0 capacity right now. GPT models are too slow for us at the moment. Gemini are better but not really at same level as GPT.
denniszelada today at 11:59 AM
[dead]
SleepyQuant today at 3:07 AM
[dead]
techpulselab yesterday at 4:04 PM
[dead]
sowbug yesterday at 3:52 PM
[dead]