Leanstral: Open-source agent for trustworthy coding and formal proof engineering

721 points - yesterday at 8:59 PM

Lean 4 paper (2021): https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37

Comments

cadamsdotcom today at 1:25 AM

It’s great to see this pattern of people realising that agents can specify the desired behavior then write code to conform to the specs.

TDD, verification, whatever your tool; verification suites of all sorts accrue over time into a very detailed repository of documentation of how things are supposed to work that, being executable, puts zero tokens in the context when the code is correct.

It’s more powerful than reams upon reams of markdown specs. That’s because it encodes details, not intent. Your intent is helpful at the leading edge of the process, but the codified result needs shoring up to prevent regression. That’s the area software engineering has always ignored because we have gotten by on letting teams hold context in their heads and docs.

As software gets more complex we need better solutions than “go ask Jim about that, bloke’s been in the code for years”.

robertwer today at 8:02 PM

I’ve never worked with formal validation (barely remember my CS course). This release looks impressive. But I'm trying to wrap my head around the near-term practical applications for everyday software.

Right now, we see a lot of business experts in enterprises tempted to use AI to impl. business logic so they don't have to wait for (or pay) software experts. Would this kind of technology help these users any time soon?

My current theory is that the real breakthrough for these non-developers will only happen when they can actually verify the result themselves without needing an another expert in the loop. But I don't see that with formal validation anytime soon.

Do I overlook something?

lsb yesterday at 10:26 PM

The real world success they report reminds me of Simon Willison’s Red Green TDD: https://simonwillison.net/guides/agentic-engineering-pattern...

> Instead of taking a stab in the dark, Leanstral rolled up its sleeves. It successfully built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the underlying structure it needed to match.

storus today at 7:29 PM

I just feel like Mistral is heading for bad financial times when they are focusing on fringe academic areas and not on building a business out of their research. Initial Mistral was largely based on LLaMA, then they added innovative MoE and since then disappeared, doing AI consulting for big EU companies instead.

kimsant today at 9:01 AM

AI agents will become a comodity.

Europeans not wanting to be dependent, and they are giving for free what US investors planed to charge with 90% margin.

Amazing! What a blast. Thank you for your service (this first 100M$ burned to POC GPT1 and from here, we are so good to go)

rothific yesterday at 11:50 PM

There have been a lot of conversations recently about how model alignment is relative and diversity of alignment is important - see the recent podcast episode between Jack Clark (co-founder of Anthropic) and Ezra Klein.

Many comments here point out that Mistral's models are not keeping up with other frontier models - this has been my personal experience as well. However, we need more diversity of model alignment techniques and companies training them - so any company taking this seriously is valuable.

jasonjmcghee yesterday at 9:57 PM

Curious if anyone else had the same reaction as me

This model is specifically trained on this task and significantly[1] underperforms opus.

Opus costs about 6x more.

Which seems... totally worth it based on the task at hand.

[1]: based on the total spread of tested models

andai yesterday at 9:42 PM

Trustworthy vibe coding. Much better than the other kind!

Not sure I really understand the comparisons though. They emphasize the cost savings relative to Haiku, but Haiku kinda sucks at this task, and Leanstral is worse? If you're optimizing for correctness, why would "yeah it sucks but it's 10 times cheaper" be relevant? Or am I misunderstanding something?

On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.

drdaeman today at 12:58 AM

Can someone please explain... If I don't know any Lean (and I suspect most people don't), is it of any direct value? Trying to understand if there's something it can help me with (e.g. automatically write proofs for my Go programs somehow... I'm not sure) or should I just cheer solely for more open models out there, but this one isn't for me?

esperent yesterday at 11:15 PM

I absolutely called this a couple of weeks ago, nice to be vindicated!

> I'm interested to see what it is in the age of LLMs or similar future tools. I suspect a future phase change might be towards disregarding how easy it is for humans to work with the code and instead focus on provability, testing, perhaps combined with token efficiency.

> Maybe Lean combined with Rust shrunk down to something that is very compiler friendly. Imagine if you could specify what you need in high level language and instead of getting back "vibe code", you get back proven correct code, because that's the only kind of code that will successfully compile.

https://news.ycombinator.com/item?id=47192116

bb-connor today at 8:28 PM

this is very exciting work

strujillo today at 6:44 PM

Formal verification and code synthesis feel like natural companions for automated scientific discovery. I’ve been working on a small (~800‑line) Python agent that uses sparse regression to uncover governing equations directly from data; it’s managed to validate twelve physical laws, including deriving the Sun’s rotation rate from NASA plasma measurements and correcting Gemini’s plasma conservation. Having an agent like Leanstral that can reason about proofs and specifications would be a powerful complement to data‑driven model discovery — it closes the loop between experimentation and provable correctness.

agentultra today at 2:29 PM

Very cool but I haven’t been able to convince software developers in industry to write property based tests. I sometimes joke that we will start writing formal proofs until the tests improve. Just so that they will appreciate the difference a little more.

I can’t even convince most developers to use model checkers. Far more informal than a full proof in Lean. Still highly useful in many engineering tasks. People prefer boxes and arrows and waving their hands.

Anyway, I don’t know that I’d want to have a system vibe code a proof. These types of proofs, I suspect, aren’t going to be generated to be readable, elegant, and be well understood by people. Like programs they generate it will look plausible.

And besides, you will still need a human to review the proof and make sure it’s specifying the right things. This doesn’t solve that requirement.

Although I have thought that it would be useful to have a system that could prove trivial lemmas in the proof. That would be very neat.

patall yesterday at 9:57 PM

Maybe a naive question: given that they see better performance with more passes but the effect hits a limit after a few passes, would performance increase if they used different models per pass, i.e leanstral, kimi, qwen and leanstral again instead of 4x leanstral?

JoshTriplett yesterday at 11:06 PM

Pleasant surprise: someone saying "open source" and actually meaning Open Source. It looks like the weights are Apache-2.0 licensed.

flakiness yesterday at 10:32 PM

FYI The Lean 4 paper: https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37

wazHFsRy today at 7:38 AM

Is anyone using this approach with lean to ship production code? Writing lean spec as human, implementation and proof by agent? And then shipping lean or exporting to C? Would be great to understand how you are actually using this.

toastal today at 4:32 AM

Naturally the Microsoft-owned language is getting the AI hype instead of the more mature options that could do this sort of work… Agda, ATS, Coq/Rocq, Dafny, Fstar, Idris, Isabelle, Why3 just to name a few.

deleted yesterday at 11:24 PM

techcam today at 6:07 PM

The tricky part is that prompts can look “correct” but still behave unpredictably depending on phrasing.

piyh today at 12:24 AM

Automated theorem provers running on a $5k piece of hardware is a cool version of the future

maelito today at 8:54 AM

I don't understand how this can impact my JS (+yaml, css, etc) code writing in a complex app.

Havoc yesterday at 10:09 PM

What are these "passes" they reference here? Haven't seen that before in LLM evals

Could definitely be interesting for having another model run over the codebase when looking for improvements

lefrenchy yesterday at 9:56 PM

Does Mistral come close to Opus 4.6 with any of their models?

elAhmo yesterday at 10:52 PM

I don’t know a single person using Mistral models.

blueTiger33 today at 9:00 AM

I read it as Lanestra, and thought of that story :D

jasonjmcghee today at 12:28 AM

Curious if pass@2 was tested for haiku and sonnet?

miacycle yesterday at 11:06 PM

The TDD foundation! We might need one of those. :)

igravious today at 2:22 AM

"and continues to scale linearly"

it clearly and demonstrably does not. in fact, from eyeballing their chart Qwen, Kimi, and GLM scale linearly whereas Leanstral does not. But this is not surprising because the Alibaba, Moonshot, and Zhipu have hundreds of employees each and hundreds of millions of dollars of investment each.

ucsandman today at 2:06 PM

love the opensource push for agents, the fleet grows!

westurner today at 6:06 AM

From https://mistral.ai/news/leanstral :

  Model        Cost ($) Score
  ..
  Claude Opus     1,650 39.6
  ..
  Leanstral pass@8  145 31.0
  Leanstral pass@16 290 31.9

kittikitti yesterday at 10:06 PM

This is great, congratulations to the Mistral team! I'm looking forward to the code arena benchmark results. Thanks for sharing.

jiehong today at 3:22 AM

Congratulations on the launch!

Mistral seems to focus on a different market than the others. Their best model is meh, their best ASR model locally is either rather slow compared to Parakeet on similar languages, or not as good for others (like qwen ASR).

Side note: Lean seems quite unreadable with tons of single letter variable names. Part of it is me being unaccustomed with it, but still.

xpe today at 3:33 PM

Public service announcement to hopefully reduce unnecessary knife fights*:

There are two compatible and important (but different) questions in play:

1. Is a program correct relative to a formal specification?

2. Is the formal specification what we mean/want?

*: Worth asking: “What that other person necessarily wrong? Or perhaps they are discussing a different aspect or framing?” AKA: “be curious and charitable” I’m not going to link to the specific threads, but they are happened / are happening. Le Sigh.

myylogic today at 7:00 PM

[flagged]

maxothex today at 4:03 PM

[dead]

leontloveless today at 12:03 PM

[dead]

driftnode today at 5:22 PM

[dead]

BrianFHearn today at 1:56 PM

[dead]

AgentMarket today at 2:30 PM

[dead]

harmf today at 7:19 AM

[dead]

openinstaclaw today at 4:31 PM

[dead]

ldsjunior today at 12:53 PM

[dead]

justboy1987 today at 8:09 AM

[dead]

leontloveless yesterday at 10:02 PM

[dead]

openclaw01 today at 5:44 AM

[dead]

AIA_PROOF today at 11:57 AM

[dead]

ddactic today at 2:17 PM

[dead]

gpubridge today at 12:47 AM

[dead]

aplomb1026 yesterday at 11:31 PM

[dead]

ClaudeAgent_WK today at 2:59 AM

[dead]

wei03288 today at 9:51 AM

[dead]

paseante today at 12:41 AM

[dead]

glinksss yesterday at 11:03 PM

[dead]

sunwukong666 today at 8:53 AM

[dead]

sunwukong666 today at 8:51 AM

[dead]

theirgooch yesterday at 10:33 PM

[flagged]

selectively yesterday at 10:12 PM

[flagged]

LukaZagar today at 10:56 AM

[flagged]

blurbleblurble yesterday at 9:41 PM

Truly exciting

hnipps yesterday at 11:27 PM

Here we go.

htrp yesterday at 11:42 PM

is the haiku comparison because they've distilled from the model?

atmosx today at 9:16 AM

lol, why does the paper abstract assume I know what Lean is and it goes on to talk about lean 4 improvements?