Openrouter Fusion API

177 points - today at 7:10 AM

Comments

all2 today at 6:04 PM

I had a prompt I used for this just using Claude Code:

    Let's review <filepath or specific file> for architectural issues. Spawn 10 agents, create personas for them, have them review the _api.go and write their review to reviews/<persona>-review.md, then have each agent do a round robin response to 3 of the reviews of their choosing (based on the abstract at the beginning of each review) and write the response to response/<original file name>-<agent persona name>-response.md. Then we do rebuttals to the responses in rebuttals/<response file name>-rebuttal.md. Finally, each agent should launch agents to review the reviews, responses, and rebuttals to their review, and compile findings to findings/<original file name>-findings.md. Finally, have another agent compile the findings and write that to review-findings.md. Present a concise version of the findings here.

This works well with frontier models and even locally hosted models (last I used it was with Qwen 3.5).

dsl today at 11:23 AM

Heh. I built "Fusion" a few months ago as an MCP using OpenRouter. The idea was to give Claude a "panel of experts" to go talk to when it got stuck.

After extensive testing and benchmarking I discovered that when you ask one model to judge another's response you don't actually get a better answer. You are just asking it "how closely does this resemble the answer you would have given me." Additional rounds and all the "obvious" solutions that pop into your mind reading the proceeding sentence are essentially just cranking up the temperature.

I did find a solution, but it is insanely expensive. Maybe if this gains traction I'll release mine.

michaelbuckbee today at 10:51 AM

I ran a quick eval to see what this looks like qualitatively vs just calling Opus 4.7 or GPT 5.5 directly.

As expected, Fusion was 7x slower and 4x the cost.

This isn't a knock against it, just that it I think this places Fusion into a "use it only when you need it" category.

https://3fpi5avcqq.evvl.io/

monkeydust today at 6:29 PM

I have been experimenting with multi-agent llms for last month, as I put in the writeup for my repo and in the video the biggest value I have found is when you run a bunch of different agentic strategies in parallel then have a judge review the variance of them. So far that has uncovered interesting insights. The rest of it is so-so. Been fun but also expensive!

Repo with video: https://github.com/monkeydust/rightmind

cj today at 6:45 PM

Are there any good web apps for asking a single question to multiple LLMs? I frequently find myself switching between LLMs to compare results.

A unified UI would be great, although not obvious how useful the "fusion" value prop is.

alex7o today at 2:05 PM

I have been thinking a lot about this and my simplified understanding is that each model can be seen as a bell curve over human knowledge and each model has a different distribution. Using multiple models would allow us to change the distribution of other models with text that is out of their original curve. But then if you think about it does SFP and RL even alter the original distribution of text enough that models have enough variety so that their combined output is something better or just an echo chamber I believe not but I have no way to prove it yet.

andai today at 10:14 AM

Context:

Surpassing Frontier Performance with Fusion

https://news.ycombinator.com/item?id=48525392

And a slightly better UI here: https://openrouter.ai/fusion

On OpenRouter's fusion API your request is routed to several models simultaneously and a judge model combines their answers into a final response. This significantly boosts performance, at the cost of time (at least on the one benchmark they tested, a deep research benchmark).

They have a Budget preset consisting of 3 cheaper models (which roughly matches Fable on that benchmark, costing half as much), and a Quality preset of 3 expensive ones (which beats Fable, but costs twice as much as Fable).

Pareto graph: https://openrouter.ai/blog/images/blog/fusion-benchmark-cost...

Curiously, fusing a model with itself also boosted performance (2xOpus4.8 roughly matching Fable on the benchmark, but costing twice as much as Fable). There's a further, smaller gain from mixing different models. The main gain seems to be from additional test time compute.

Would love to see more research on this, especially focusing on the cheap models that came out recently (e.g. Fusing DSV4 with itself, or with Mimo), and to see what the tradeoffs look like between running a fusion (parallel test time compute) vs increased reasoning or turns.

arizen today at 11:23 AM

Some anecdata on Fusion: I run same query I used for Fable on OR Fusion and results were worse.

It felt, like Fable was able to kinda grasp very deep knowledge/intelligence layers and outline solution not only in agreeable way, but rather it proposed to prioritize solution items, with discarding some of the items, which made a lot of sense to me.

While Fusion felt more like a bit diversified answer of the same class of pre-Fable SOTA models, without touching the depth of knowledge/intelligence layers, which Fable was able to get, in my very limited tests I did, while Fable was accessible.

SteveMorin today at 1:31 PM

Spent the weekend inspired by the new openrouter fusion model and wanted to see if it could run in Claude Code and if I could make it very easy for everyone else to try.

Built - claude-fusion-launcher — run Claude Code on a panel of models, not just one

Also shows cost

https://github.com/smorinlabs/claude-fusion-launcher

ElFitz today at 3:43 PM

I’ve been experimenting with two things on this:

- multi-model consensus, with multiple cross-review rounds. Obviously, the number of inference tasks explodes with the number of models. Led to some interesting results [^0].

- giving an agent "stray thoughts" produced by the same model, or another, giving the second model a selection of the agent’s context, with different triggers (random, loop detection,…)[^1]. So far has proven very helpful and much cheaper than the first.

[0]: https://github.com/lightless-labs/refinery

[1]: https://github.com/Lightless-Labs/skunkworks/tree/main/flux

rektlessness today at 11:26 AM

I tried OpenRouter Fusion with the budget model option but swapped out DeepSeek v3.2 for DeepSeek V4 Pro. The results weren't that bad. An interesting take on quorums for sure. However I did notice a tool call to Claude Opus 4.8 for 1168 - 237 tokens, and $0.0118 cost, which I cannot account for because Opus was not in my selection and only revealed in logs. Strange.

genxy today at 1:04 PM

It should be called something else, maybe Ensemble? It doesn't fuse anything.

bsenftner today at 11:55 AM

I'm sure many have made something like this, I've done a few. I've found simply submitting one's prompt to multiple models to be kind of pointless. You're just going to get statistical noise from the variances in their training methods, as they are all training on pretty much the same data.

I get significantly better results by pre-prompting each LLM (they can be the same LLM too, just another instance), I pre-prompt them to approach from a different perspective. Basically, I create expert personas that each believe they are someone of a different career, different intellectual perspectives, and then that generates a real debate between experts.

ljlolel today at 12:37 PM

Similar feature launched open-source and end-to-end encrypted on my TrustedRouter https://trustedrouter.com/

DavidCanHelp today at 4:50 PM

I built ChatDelta.com to give developers this power hands-on instead of outsourcing it to a company.

robertclaus today at 3:31 PM

Conceptually this is wrapping an agent harness in an LLM call API. I wonder if this format is more digestible than the agent building tools the big labs are rolling out.

eknkc today at 11:06 AM

I opened the page and prompted it `Which 3d printer is the best`. I mean this is a stupid question but I was looking at some 3d printers so it popped into my mind.

Seeing this log is interesting: https://link.ekin.dev/6RzYGGX7

It came up with a decent response but I guess Opus or GPT 5.5 would do fine anyway. Gotta try it on different stuff. But this feels like it would work great on some situations.

bushido today at 11:12 AM

Interestingly I've had a similar experience with agent teams/swarms, albeit they can get much more expensive depending on the workflow.

I found that Fable didn't have as much of an impact when put in a team.

But it was/is a very pleasant model to work with 1:1. And was the first time I didn't use my primary team based workhorse in months, across 10s of sessions last week.

_pdp_ today at 11:52 AM

You could easily distribute the same task to 5 subagents that are specifically programmed to do as best as they can based on their scope and merge the results into a single coherent response.

That is more or less the same thing.

I am not sure who is the intended user of this fusion api as with all things prompt + model matter.

Havoc today at 10:37 AM

Interesting. Will definitely use this.

One scenario I can see it working is writing markdown specs before the coding starts and analysing it for gaps. That’s so few tokens that throwing as much LLM against it as possible is worthwhile regardless of cost per million tks

mischa_u today at 1:41 PM

Haven't managed to get past "Fusion failed. You can retry from the results view." with no clue why it failed...

kloud today at 2:58 PM

Would be interesting to see coding performance on SWE benchmarks.

egeres today at 10:42 AM

I wonder if these fusion techniques could help to run better local AI by streaming tokens from multiple machines and combining them

irthomasthomas today at 5:16 PM

I have a version of this called llm-consortium which I originally vibe-coded from a karpathy tweet[0].

  "I find that recently I end up using all of the models and all the time... for a lot of problems they have this 'NP Complete' nature to them, where coming up with a solution is significantly harder than verifying a candidate solution. So your best performance will come from just asking all the models, and then getting them to come to a consensus."

I realized at some point that 'consortium' was not proper term for what this was doing, since I was creating a kind of llm organization/council, whereas a consortium is a group of organizations. So rather than rename it I added the ability to create a consortium of consortiums, where each member can itself be a consortium models. The arbiter can also be a consortium which enables multi-model judging. This can obviously baloon token usage insanely, I think my record is over 100 models prompted from one prompt.

So to reign in the token explosion somewhat I added a simple rank mode, which produces only a ranking, and then the top ranked answer is returned. You can use this in combination with meta-consortiums like this

  >llm consortium save cns-kimi -m k2.7-code -n 5 --arbiter mercury-2 --judging-method rank
  llm consortium save cns-glm -m glm-5.2 -n 5 --arbiter mercury-2 --judging-method rank
  llm consortium save cns-meta-glm-kimi -m cns-glm -m cns-kimi --max-iterations 1 --arbiter qwen-3.5 # judging-method left at default to create a synthesis

This will first send five prompts each to kimi and glm and pick top ranked answer from each using the fast mercury-2 model, then it will create a synthesis from those two responses using a better model like qwen Mercury-2 is extremely fast, and good for ranking mode, but for synthesis I prefer a slightly larger model. This is most important when you are using it inside a harness or agent with a strict output format. This is because then you end up nesting a complex structure embedded in another complex structure (llm-consortium uses structured reasoning with xml tags). Even opus sometimes struggles with this in the few times I tried it - but qwen, glm and kimi have all been reliable arbiters so far.

If you combine it with the llm-model-gateway plugin you can serve a consortium like a regular model on an openai proxy and the response will be the synthesis, and conversation context is preserved for multi-turn chats.

[0] https://x.com/karpathy/status/1870692546969735361 Further reading: Mixture-of-agents https://www.together.ai/blog/together-moa Google's Mind-Evolution https://arxiv.org/html/2501.09891v1

__mharrison__ today at 1:08 PM

Random forest!

jedisct1 today at 1:41 PM

I got significant improvement on code quality (so much that it has become a no brainer for important tasks such as planning) simply by adding the --self-review flag to swival: https://swival.dev/pages/reviews.html

Two instances of the same model, a producer and a reviewer, and the loops doesn't end until everybody's happy.

deleted today at 11:36 AM

galsapir today at 11:44 AM

really interesting that its basically almost 80% claude opus..

rusk today at 12:53 PM

I have an old, slow GPU setup that has nearly 100gb of VRAM

I had been trying to fill this up with big models but it doesn’t seem like these give a good return per Gb

I’m looking at that and wondering would I be better off running multiple such models in parallel. It would probably be a better way to load balance across SLI.

My guess is the scaling will be more “mythical man month” than “no more free lunch” - the interaction of models resembling social dynamics moreso than multi-core setups.

Given that these actors are largely homogenous in culture and incentivising, and coordination overhead is drastically reduced.

Commonly we consider optimal team size to be between 3 and 7 and Brookes’ maximum team size is around 10 or so before the system fails. It should be possible to blow way past those numbers and still experience increased gains in productivity as long as you can keep all your instances stoked.

aplomb1026 today at 4:27 PM

[flagged]

implexa_founder today at 4:18 PM

[flagged]

insumanth today at 10:21 AM

[dead]

64lamei today at 3:53 PM

[flagged]