That's almost exactly my setup and I'm very happy with its performance.
I noticed recently that I started to prefer my local Qwen3.6 35B A3B and pi agent over Claude Code.
Both fail at different tasks, and Qwen more so than Claude.
But the way Qwen fails is much more straightforward. In writing tasks Qwens hallucinations and bullshitting are much easier to spot because it doesn't have the sleek vocabulary and wordsmithing skills to disguise its ignorance.
In coding tasks that Qwen can't solve it often just goes into a tool calling doom loop that the pi harness can catch, whereas Claude attempts ever more convoluted and creative things just making more and more mess that takes forever to clean up.
I think part of the story is that the tasks for which I use AI are fairly simple and maybe don't need a frontier model. But I wonder if "proper" developers had similar experience?
ydjtoday at 4:26 PM
80tp/s with 5080 3090 combo is wild. I’ve been working with a 4090 and two Tenstorrent p150 cards, and manage only about 30 tps utilizing all three for qwen3.6 27b q8. Guess I got more optimization to do.
Would like to see the perf of their setup with and without mtp and ngram speculative decoding though, as well as parallel decode performance (once llamacpp mtp plays well with multiple slots).
Being in California electricity alone puts this non-competitive with just paying a cloud though.
Would you mind giving these a try and let me know how they work for you? I’d imagine you would get better results and the latter will fit on a single GPU.
I just bought a $25 chinese 2x Oculink card and two Minis Forum DEG1, had some spare PSUs lying around, and just installed two cards on each.
It works.
I saw that there is also a 4x Oculink card, but i don't know it that will work, too.
cybertimtoday at 5:29 PM
I bought two 3080/20gb and one of those MACHINIST X99 mainboards as well (one with two full x16 pcie slots) those boards come with a xeon cpu included (for the pcie lane support) it set me back 800 euros total (had a spare psu, ssd and mem in a drawer) and now im also happily running 80tk/s Qwen 3.6 Q8 (MTP).
staredtoday at 5:20 PM
I really like Qwen 3.6 27B Q8.
On Apple Silicon, with MLX-LM, I am getting 20 tok/s with Macbook Max M5.
Not sure how it compares to llama.cpp performance.
In any case, while it is noticeably slower than this Nvidia RTX setup, being able to run such models on laptop is wild. Though, it heats my laptop rapidly.
dengtoday at 3:34 PM
I can understand the joy of running things yourself, and can also see the privacy aspect. However, I pay ~3$ per 1M/tokens for that model on Openrouter, and it's not even quantized. A refurbished 3090 and a 5080 will set you back well over 2k, not to mention the electricity to run them...
tonyricetoday at 5:44 PM
If I had an eGPU right now, I'd 100% be using Qwen
ComputerGurutoday at 3:07 PM
I would have liked to see a bit more on the theory side of things, explaining optimal weight and inference splits, actual issues with existing drivers, etc instead of what’s essentially just a recipe.
varispeedtoday at 4:29 PM
Could 2x RTX5080 work just as well?
well_ackshuallytoday at 5:23 PM
It does come with one tiny little issue: it now draws 700W on full load. Just a single 5080 is enough to measurably heat up a room when loaded (320W draw at the wall on mine), and with that amount of power flowing through, you better have a good PSU as well as checking your power plugs themselves, these are going to get HOT when your entire setup is basically drawing 1kW.