Benchmarks in Leipzig

80 points - today at 2:00 PM

Comments

christianstump today at 2:51 PM

I am the leader of the study and the author of the benchmark paper: let me add: the problems are much harder than any exam question in any exam.

Think of it as: a PhD student studying exactly this area of mathematics would need days to weeks to understand and solve the question.

But nonetheless, these are questions about existing research, but much closer to a question given a second-year PhD student than to an exam question.

spuz today at 3:17 PM

As well as measuring how many questions each model was able to answer correctly, I think it's equally important to measure how many questions each model answered incorrectly. After all, if you consider using them as a tool, you will need to have confidence that any answer they give is correct.

If you look at Table 3 you can see the difference in performance between for example GPT 5.5 and Opus 4.7 for each of the 20x 100 runs:

- GPT 5.5: 1389/2000 questions answered, of which 1043 were correct (75%)

- Opus: 1306/2000 questions answered, of which 294 were correct (22%)

So while you can claim that Opus solved 40% of the problems it still had a failure rate of 78%. That means if you chose this model to answer your homework question, there is a good chance you would fail.

Perhaps a more useful benchmark for future models is measuring how many of these types of questions they can answer in one shot. I.e. how confident can you be when using them for real world tasks.

zerobees today at 2:21 PM

I know that people with strong feelings one way or the other will comment here, but note that this is specifically about problems with known answers that can be inferred from existing literature (e.g., training data).

This is an interesting result, but as I understand it, it's not about solving frontier challenges (which LLMs can evidently do too, but that's not what's tested here). It's closer to "can a mathematician (blindly) write exercises you can't cheat on using an LLM". "Blindly" in the sense that they can't adjust the problem ahead of the time until they get a model to fail.

The conclusion in the paper is: "The concept of writing exercise-style benchmark questions based on publicly accessible research has reached its limits when it comes to the best-performing available models."

tomtomatoide today at 4:55 PM

For some reason, perhaps some sort of Freudian self-defense mechanism, we tend to downplay how impressive solving never seen problems that require deep understanding of the concepts at play requires.

Look for final exams of advanced courses in CS or math. It will be clarifying how close (or plainly, harder) the questions from the study are. And so how impressive the capabilities these models are achieving...

puttycat today at 3:13 PM

Hopefully they password-protect the datasets:

https://arxiv.org/abs/2305.10160

qsort today at 2:26 PM

These are the results from the website they link in the paper:

https://math.sciencebench.ai/benchmarks

I take the "2 unsolved" claim to mean "not solved by any model in any configuration in any stage with any number of attempts", the "benchmark results" are much lower. To be clear: it's extremely impressive, I still remember I was in utter disbelief when models started solving AIME problems, and this is obviously several levels above that.

It's also interesting that OpenAI models perform that much better on math and math-adjacent stuff. I assume this comes down to differences in post-training?

root-parent today at 2:02 PM

"...Between April 1 and May 15, 2026, a group of 49 mathematicians compiled a dataset of research-level mathematics questions with known answers... We present the resulting collection of 100 questions....We evaluated these questions in three stages: a single attempt by five state-of-the-art LLMs....we concluded Stage 3 with only 2 unsolved questions. This demonstrates that the mathematical reasoning capabilities of LLMs are becoming impressive..."

openclawclub today at 2:26 PM

[flagged]

Towaway69 today at 2:24 PM

As long as it's not conscious, we're safe.