Will It Mythos?

277 points - today at 4:15 AM

Comments

Tossrock today at 5:48 AM

As I posted in another comment, I found Fable to be substantially more powerful than any previous model. However, this isn't just an ungrounded opinion - I uploaded my full session transcript and code created working on a very complex implementation, so people can judge for themselves, if they're interested: https://tossrock.substack.com/p/36-hours-with-fable

JumpCrisscross today at 10:23 AM

> Note GPT 5.5 Pro is at the top of the leaderboard only because it blew through $100 budget after only completing four cases, so 2/4 is 50%. And, a couple of other results, both Qwen models, are skewed upward in the detect % ranking because of failure to complete all cases.

Try a Wilson score interval on the lower bound of the binomial proportion confidence interval [1].

So GPT 5.5 Pro’s 2/4 (p = 0.5) for one-sided 95% (z ~ 1.645), adjusts to 0.182 [a], and the top models are revealed as the 4/9s (mimo-v2.5-pro, gpt-5.5, opus-4.8, gemini-3.5-flash and deepseek-v4). (We need to dial CI down to 76% for gpt-4.5-pro to regain top status.) If we account for speed in that cohort, derpseek-v4 (91s) is fastest followed by opus-4.8 (137s).

Given deepseek-v4 is also the cheapest model among those five, I would say—based on these data—it’s the winner. (Out of the table. If Fable got 9/9, it’s obviously first.)

[1] https://en.wikipedia.org/wiki/Binomial_proportion_confidence...

po1nt today at 5:47 AM

From all the things I read I'm pretty convinced that Mythos is just standard LLM with safety features turned off. If current models weren't reluctant to search for vulnerabilities, they might perform as good as Mythos.

jrochkind1 today at 4:37 AM

> And, all of the bugs can be identified by several models if they are pointed directly at it and told what to look for.

This made me think, well, sure, if you tell them what to look for... but then:

> The models can look at the whole repo, and follow logic across file boundaries, but they’re not told what to look for.

So okay, the first one was an accidental mis-statement?

utopcell today at 4:37 PM

The "best" model finds 4/9 bugs. It would be interesting to see if all models find the _same_ bugs. Does a collection of models exist that can cover all 9?

Also, it seems to me that pointing a model to a bug and asking it to solve it is somewhat easier than what Mythos did, which if I understand correctly, was to generally look at a codebase and find any bug. Even so, non-Mythos models only managed to fix 4/9 of these bugs.

I think the article makes the point that Mythos is at a different level.

airstrike today at 6:33 AM

Around February, Opus 4.6 was excellent. Smart, fast, proactive. Then it got lobotomized and it's never been the same after that nerf. 4.7 came along and it too was disappointing—not unlike 4.8, which despite feeling a smidge smarter, tends to write word salad and is basically unusable for some workflows.

Fable felt like having access to that "old Opus" again, but a little smarter. Sort of like I'd expect an Opus 5 to be. It's not earth shattering, but it was a step in the right direction. And it was distinctively so, because having to go back to Opus 4.6/4.7/4.8 has been borderline depressing...

It understood more with less help, did more per turn, and was less argumentative. It also felt a little less trite in its answers, which is an understated improvement for those who use claude code all the time

p0w3n3d today at 9:57 AM

I've read opinions that this a speculation to raise the Anthropic's value. They are known to say "horrific things" and personification of the AI they are delivering. It sometimes sounds unprofessional even.

This line of communication might have even influenced the courts in the case of copyright violation ("it is not copyright violation if a person learned something and it knows it and thinks of it"). However algorithm does not think. If I took your book and lossy encrypted it, and then unencrypted it while filling the broken words, am I violating your copyright or not?

rubymamis today at 8:46 AM

Fable was the only model that was able to detect a data corruption bug in my Qt C++ note-taking app[1] that all other tested models (gpt-5.5 xhigh, GLM-5.1, Kimi 2.7, DeepSeek V4 Pro) didn't find. I'll test on GLM-5.2 and Mimo v2.5 Pro soon.

[1] https://www.get-notes.com

rirze today at 6:08 PM

I'm convinced if Mythos/Fable comes back at this point, it will be guardrailed into lobotomy.

It won't be as good.

jaggederest today at 4:57 AM

In my brief experience, the difference between fable and opus is largely in persistence, not global intelligence like you might expect. Fable just... goes the extra mile, sometimes in a scary way.

snthpy today at 12:27 PM

I miss Fable. Will it ever be back? As a non-US citizen living in Africa i fear that i will have to wait for an equivalent non-US model.

wrs today at 4:18 PM

IIRC from the Anthropic report, the alleged danger of Mythos isn’t that it finds more vulnerabilities than previous models, but that it’s significantly more successful at exploiting them. Which this doesn’t seem to test.

qaq today at 6:25 AM

Fable was able to oneshot pretty big features. In write spec -> refine spec -> create todos -> implement todos workflow difference was far less pronounced vs codex or opus.

draginol today at 2:12 PM

I was pretty impressed with Fable when I used it. Fable on Low was better than Opus 4.8 on High (and cheaper).

Now, for me, it was really about how well it worked on big existing human made code bases. I was working on some new screens in GalCiv IV and if you've ever had to make screens for games, it is incredibly tedious, low brain work. But GPT 5.5 and Opus 4.8 would just struggle with these over and over again and this is C++ work with limited hotloading so it's a slow process. Fable nailed these screens fast.

stared today at 7:55 AM

For malware detection, many models are biased for or against detecting a threat (likely a thing that can be adjusted with a prompt).

I suggest tasks cannot be guessed (find, not tell). And 2d charts, both for ROC and pricing, vide https://quesma.com/benchmarks/binaryaudit/

GeorgeWoff25 today at 5:46 AM

Spatial reasoning is where fable really separates itself imo

sfjailbird today at 12:40 PM

[flagged]

_alternator_ today at 2:29 PM

This is cool, but note that it doesn't address one of the main (claimed) advantages of Mythos: lower false positive rates. That is, give it files without serious bugs and it will not raise alarms.

himata4113 today at 7:18 AM

What makes mythos special is the fact that someone with zero expertise in the field could find and weaponize a zero-day. Real threat actors already use llms em masse and the recent advancements with glm-5.2 will probably enable way more cyber attacks than fable ever could.

rbbydotdev today at 1:11 PM

I find it ironic, we now have to use lesser models to write potentially MORE buggy code, than greater models which would allow you to write LESS buggy code. It's paradoxical.

irthomasthomas today at 10:03 AM

I find this interesting:

  …no model performed better with an Agent, a couple performed worse, and time/tokens/costs were consistently much higher with the agent in the loop, for some reason.

Somone should build a harness where features are only added if they are proven net positive to outcomes.

StizzurpXDD today at 6:06 AM

This just shows that Google needs to double down on its AI models fast. Even open source chinese models are beating 3.1 Pro and 3.5.Flash in almost everything.

wiz21c today at 11:53 AM

As a european, it's funny to read those stories about Fable and not being able to check for myself. It looks like being a kid watching other kids playing with nicer toys.

deleted today at 6:01 AM

jonplackett today at 8:07 AM

I thought the whole point was that it doesn’t need to be pointed at the problem. That’s a much easier problem to solve. Also you eliminate 10000 false positives.

ryangg today at 9:50 AM

The leaderboard sorting is very misleading, gpt-5.5-pro only found 2 while mimo-v2.5-pro found 4.5 out of 9 cases.

FartyMcFarter today at 7:22 AM

Is the title a reference to "will it blend"?

wald3n today at 5:52 AM

The benchmark fills an interesting niche, but the methods need work considering how many caveats are included in the results.

GL26 today at 7:25 AM

Frankly after testing out Fable last week, it was just a bigger sink of tokens than anything else. The amount of tokens consumed by it wasn't worth the steps it saved me compared to using opus 4.8.

mixmastamyk today at 5:34 AM

Could someone point the thing at Ventoy please?

catigula today at 12:59 PM

What year are we in?

>I am skeptical of the reasons given publicly, I suspect it’s really just so much more expensive to operate than their current models that they don’t want to offer it broadly, yet, given the difficulty they’ve had growing capacity to keep up with use. But, are they telling the truth about how good it is at finding security vulnerabilities or is it just more hype?

Meanwhile,

1. Mythos is banned by the government per reality.

2. The NSA said it hacked all of their systems in hours per multiple sources.

3. The Five Eyes spy agencies said we're about to have an AI global catastrophe in a few months per the Guardian.

mcoliver today at 6:08 AM

Gemini / antigravity didn't use to be this hamstrung. Something recently changed within the past couple months that makes doing security work very difficult to do. Even auditing/securing your own code now requires an insane amount of prompt engineering that is utterly ridiculous and did not use to be required.

holoduke today at 6:12 AM

Yesterday I wanted to delete records from a database in my own ssh server. It refused to do so. No matter what I prompted. Very annoying.

reinitctxoffset today at 4:45 AM

Opus 4 class models are terrifying at infosec. They tie their shoelaces together on other things, but don't fuck with them on that. It's a savant thing.

A cursory reading of the model card shows Mythos/Fable is a fine tune on Project Zero with some steering on persistence.

But I think it's a valuable lesson: advertise your product as a nuclear weapon while microdosing at Lighthaven to enough Davos attendees and sooner or later? Someone is going to evaluate the claim from a chair where you act first and nuance later.

Wild that Amodei's blog and pod circuit are the greatest IPO risk.

bob1029 today at 6:01 AM

[dead]

fabijanbajo today at 6:46 AM

[flagged]

fsadsadsdasdas today at 5:47 AM

事実は小説よりも奇なり

bottlepalm today at 5:44 AM

Surprise.. someone downplaying Mythos/Fable that didn't actually use it. Plenty of comments here to the contrary, including my own personal experience with Fable was easily a step change in capability over Opus - figuring things out in reverse engineering binaries that Opus plain couldn't find.

davedx today at 7:00 AM

I don't understand the article.

"I’d say this benchmark answers with a resounding, “Maybe.”

Mythos maybe really is better than the other current models at finding security bugs"

Yet in the results, I don't see Mythos?

It seems like a really well researched article with lots of results for other models, yet the title seems to be clickbait because the results don't contain Mythos, do they?