Benchmarks in Leipzig
80 points - today at 2:00 PM
SourceComments
Think of it as: a PhD student studying exactly this area of mathematics would need days to weeks to understand and solve the question.
But nonetheless, these are questions about existing research, but much closer to a question given a second-year PhD student than to an exam question.
If you look at Table 3 you can see the difference in performance between for example GPT 5.5 and Opus 4.7 for each of the 20x 100 runs:
- GPT 5.5: 1389/2000 questions answered, of which 1043 were correct (75%)
- Opus: 1306/2000 questions answered, of which 294 were correct (22%)
So while you can claim that Opus solved 40% of the problems it still had a failure rate of 78%. That means if you chose this model to answer your homework question, there is a good chance you would fail.
Perhaps a more useful benchmark for future models is measuring how many of these types of questions they can answer in one shot. I.e. how confident can you be when using them for real world tasks.
This is an interesting result, but as I understand it, it's not about solving frontier challenges (which LLMs can evidently do too, but that's not what's tested here). It's closer to "can a mathematician (blindly) write exercises you can't cheat on using an LLM". "Blindly" in the sense that they can't adjust the problem ahead of the time until they get a model to fail.
The conclusion in the paper is: "The concept of writing exercise-style benchmark questions based on publicly accessible research has reached its limits when it comes to the best-performing available models."
Look for final exams of advanced courses in CS or math. It will be clarifying how close (or plainly, harder) the questions from the study are. And so how impressive the capabilities these models are achieving...
https://math.sciencebench.ai/benchmarks
I take the "2 unsolved" claim to mean "not solved by any model in any configuration in any stage with any number of attempts", the "benchmark results" are much lower. To be clear: it's extremely impressive, I still remember I was in utter disbelief when models started solving AIME problems, and this is obviously several levels above that.
It's also interesting that OpenAI models perform that much better on math and math-adjacent stuff. I assume this comes down to differences in post-training?