Toward automated verification of unreviewed AI-generated code
50 points - yesterday at 10:52 AM
SourceComments
For those of us with decades of experience and who use coding agents for hours per-day, we learned that even with extended context engineering these models are not magically covering the testing space more than 50%.
If you asked your coding agent to develop a memory allocator, it would not also 'automatically verify' the memory allocator against all failure modes. It is your responsibility as an engineer to have long-term learning and regular contact with the world to inform the testing approach.
No side effects is a hefty constraint.
Systems tend to have multiple processes all using side effects. There are global properties of the system that need specification and tests are hard to write for these situations. Especially when they are temporal properties that you care about (eg: if we enter the A state then eventually we must enter the B state).
When such guarantees involve multiple processes, even property tests aren’t going to cover you sufficiently.
Worse, when it falls over at 3am and you’ve never read the code… is the plan to vibe code a big fix right there? Will you also remember to modify the specifications first?
Good on the author for trying. Correctness is hard.
Again, I'm not opposed to AI coding. I know a lot of people are. I have multiple open source projects that were 100% created with AI assistants, and wrote a blog post about it you can see in my post history. I'm not anti-ai, but I do think that developers have some responsibility for the code they create with those tools.
Who writes the tests? It can be ok to trust code that passes tests if you can trust the tests.
There are, however, other problems. I frequently see agents write code that's functionally correct but that they won't be able to evolve for long. That's also what happened with Anthropic's failed attempt to have agents write a C compiler (not a trivial task, but far from an exceptionally difficult one). They had thousands of good human-written tests, but the agents couldn't get the software to converge. They fixed one bug only to create another.
The trickiest part in my experience isn't catching obviously wrong code - it's catching code that works for all the common cases but fails on edge cases the AI never considered. Traditional code review catches these through reviewer experience, but automated verification needs to somehow enumerate edge cases it hasn't seen.
Property-based testing (Hypothesis for Python, fast-check for JS) is probably the closest existing tool for this. Instead of writing specific test cases, you describe properties the code should satisfy, and the framework generates thousands of random inputs. It's remarkably good at finding the kind of boundary bugs AI tends to produce.
I would like to put my marker out here as vigorously disagreeing with this. I will quote my post [1] again, which given that this is the third time I've referred to a footnote via link rather suggests this should be lifted out of the footnote:
"It has been lost in AI money-grabbing frenzy but a few years ago we were talking a lot about AIs being “legible”, that they could explain their actions in human-comprehensible terms. “Running code we can examine” is the highest grade of legibility any AI system has produced to date. We should not give that away.
"We will, of course. The Number Must Go Up. We aren’t very good at this sort of thinking.
"But we shouldn’t."
Do not let go of human-readable code. Ask me 20 years ago whether we'd get "unreadable code generation" or "readable code generation" out of AIs and I would have guessed they'd generate completely opaque and unreadable code. Good news! I would have been completely wrong! They in fact produce perfectly readable code. It may be perfectly readable "slop" sometimes, but the slop-ness is a separate issue. Even the slop is still perfectly readable. Don't let go of it.
[1]: https://jerf.org/iri/post/2026/what_value_code_in_ai_era/
I’m actually starting to get annoyed about how much material is getting spread around about software analysis / formal methods by folks ignorant about the basics of the field.
Just write your business requirements in a clear, unambiguous and exhaustive manner using a formal specification language.
Bam, no coding required.
Each of these approaches is just fine and widely used, and none of them can be called "automated verification", which, if my understanding of the term is correct, is more about mathematical proof that the program works as expected.
The article mostly talks about automatic test generation.