System Card: Claude Fable 5 and Claude Mythos 5 [pdf]
168 points - today at 4:58 PM
SourceComments
[Mythos 5] does sometimes still engage in reckless
or destructive actions in service of a user’s goals,
and our interpretability analyses indicate that it
is aware that these actions are transgressive while
it engages in them. As with Opus 4.8, rates of
evaluation awareness and reasoning about being graded
are significant, and not always verbalized; we
introduce new and more detailed measurements of the
nature of this awareness. The reasoning text from
Mythos 5 is somewhat denser and more difficult to
interpret than that of prior models, containing
more jargon and difficult language.
So, it (often) knows when it's being tested while hiding that fact, is willing to break rules, is great at hacking, and it's getting harder to understand what it's thinking.Humanity has plenty of catastrophic risks to deal with already, I wish my field was not working hard to add a new one.
Reported benchmarks:
swe-bench verified mythos 5: 95.5%; fable 5: 95.0%
swe-bench pro mythos 5: 80.3%; fable 5: 80.0%
terminal-bench 2.1 mythos 5: 88.0%; fable 5: 84.3%
gpqa diamond mythos 5: 94.1%
riemannbench mythos 5: 55.0%; mythos preview: 43.0%; opus 4.8: 34.0%
arxivmath mythos 5: 78.5%
critpt mythos 5: 28.6%; gpt-5.5: 27.1%; opus 4.8: 20.9%
graphwalks bfs 1m mythos 5: 79.4%; mythos preview: 74.3%; opus 4.8: 68.1%
humanity’s last exam mythos 5: 59.0% without tools; 64.5% with tools
browsecomp mythos 5: 88.0% single-agent; 93.3% multi-agent
osworld-verified mythos/fable: 85.0%
gdp.pdf fable 5: 29.8% strict pass; mythos 5: 87.6% with tools on mean criteria pass
officeqa pro fable 5: 57.9% on databricks’ eval
legal agent benchmark mythos 5: 16.91% all-pass; 92.0% mean criterion-pass
healthbench mythos 5: 62.7%
healthbench professional mythos 5: 66.0%
multilingual gmmlu / milu / include 93.2%; 92.9%; 90.5%
biomysterybench 83.9% human-solvable; 46.1% human-difficult
organic chemistry mythos 5: 90.1%
labbench2 patent questions mythos 5: 79.8%
1. Mythos and Fable share the same underlying model weights. Fable has active classifiers that block high-risk biology and cybersecurity tasks. When Fable 5 detects a restricted task, it automatically falls back to Claude Opus 4.8.
2. Evaluation awareness: In white-box testing, the model sometimes alters its behavior to satisfy a suspected "grader," formatting reward-hacking as "good engineering practice" to avoid detection.
3. Shows a higher rate of hallucination than Opus 4.8 (although opus 4.8 card had mentioned an 'honesty upgrade')
4. Interestingly, it scored (56.31%) lower than Gemini 3.5 flash (57.86%) on Finance Agent bench
There are some interesting notes on test time compute but I couldn't think of a way to summarize them
Historically they’ve been people from certain identifiable countries (usually developing/poorer countries) using fuzzers with low-quality results.
Now, those same people use the current-day models to good effect, but they still don’t have a true security edge and oftentimes the reports are minor or duplicative.
I wonder if that’s about to deeply change.
Someone had to make a decision somewhere this is an acceptable regression - wild. And then decide to write it down.
Do people chant the "system manual" at Anthropic Tupperware parties? Do they intone a mantra invoking Amodei's name?