First Proof

75 points - today at 3:25 PM

Comments

Syzygies today at 5:26 PM

I'm a mathematician relying heavily on AI as an association engine of massive scope, to organize and expand my thoughts. One doesn't get best results by "testing" AI.

A surfboard is also an amazing tool, but there's more to operating one than telling it which way to go.

Many people want self-driving cars so they can drink in the back seat watching movies. They'll find their jobs replaced by AI, with a poor quality of life because we're a selfish species. In contrast Niki Lauda trusted fellow Formula 1 race car driver James Hunt to race centimeters apart. Some people want AI to help them drive that well. They'll have great jobs as AI evolves.

Gary Kasparov pioneered "freestyle" chess tournaments after his defeat by Big Blue, where the best human players were paired with computers, coining the "centaur" model of human-machine cooperation. This is frequently cited in the finance literature, where it is recognized that AI-guided human judgement can out-perform either humans or machines.

Any math professor knows how to help graduate students confidently complete a PhD thesis, or how to humiliate students in an oral exam. It’s a choice. To accomplish more work than one can complete alone, choose the former. This is the arc of human evolution: we develop tools to enhance our abilities. We meld with an abacus or a slide rule, and it makes us smarter. We learn to anticipate computations, like we’re playing a musical instrument in our heads. Or we pull out a calculator that makes us dumber. The role we see for our tools matters.

Programmers who actually write better code using AI know this. These HN threads are filled with despair over the poor quality of vibe coding. At the same time, Anthropic is successfully coding Claude using Claude.

_alternator_ today at 4:29 PM

These are very serious research level math questions. They are not “Erdős style” questions; they look more like problems or lemmas that I encountered while doing my PhD. Things that don’t make it into the papers but were part of an interesting diversion along the way.

It seems likely that PhD students in the subfields of the authors are capable of solving these problems. What makes them interesting is that they seem to require fairly high research level context to really make progress.

It’s a test of whether the LLMs can really synthesize results from knowledge that require a human several years of postgraduate preparation in a specific research area.

blenderob today at 3:52 PM

Can someone explain how this would work?

> the answers are known to the authors of the questions but will remain encrypted for a short time.

Ok. But humans may be able to solve the problems too. What prevents Anthropic or OpenAI from hiring mathematicians, have them write the proof and pass it off as LLM written? I'm not saying that's what they'll do. But shouldn't the paper say something about how they're going to validate that this doesn't happen?

Honest question here. Not trying to start a flame here. Honestly confused how this is going to test what it wants to test. Or maybe I'm just plain confused. Someone help me understand this?

hiq today at 6:17 PM

I'm realizing I don't know if it's currently harder for an LLM to: * come up with a formal proof that checks out according to a theorem prover * come up with a classical proof that's valid at a high-level, with roughly the same correctness as human-written papers

Is this known?

samasblack today at 3:25 PM

https://1stproof.org/#about

phs today at 9:59 PM

I wonder how many of these the authors privately know to be false.

falloutx today at 3:58 PM

Anything special about these questions? Are they unsolved by humans. I am not working in mathematics research so its hard to tell the importance.

data_maan today at 6:09 PM

As mathematically interesting the 10 questions are that the paper presents, the paper is --sorry for the harsh language-- garbage from the point of view of benchmarking and ML research: Just 10 question, few descriptive statistics, no interesting points other than "can LLMs solve these uncontaminated questions", no long bench of LLMs that were evaluated.

The field of AI4Math has so many benchmarks that are well executed -- based of the related work section it seems the authors are bit familiar with AI4Math at all.

My belief is that this paper is even being discussed solely because a Fields Medalist, Martin Hairer, is on it.

Western0 today at 6:30 PM

No, this is not a proof because not using Mizar ;-) https://mizar.uwb.edu.pl/

happa today at 3:47 PM

February 13th is a pretty close deadline. They should at least have given a month.

richard_chase today at 4:32 PM

Interesting questions. I think I'll attempt #7.

rvz today at 6:49 PM

> Conflicts of interest. No funding was received for the design or implementation of this project. None of the authors of this report was employed by or consulted with AI companies during the project, nor will they do so while contributing to it

As it should. Good.

This is a totally independent test not conducted or collaborated by any of the AI companies or employees so that no bias is introduced at all[0].

[0] Unless the researchers are not disclosing if they have any ownership of shares in private AI companies.

Aressplink today at 5:24 PM

[flagged]

baal80spam today at 3:51 PM

I'll patiently wait for the "goalpost moving olympics" after this is published.