Verified Spec-Driven Development (VSDD)

131 points - today at 4:58 PM

Comments

_pdp_ today at 5:52 PM

Everything in this post stems from the assumption that you already know what you're doing, which is probably true for things you've built before. But I hope we can agree that you can't spec out something you have no clue how to build, let alone write the tests before you've even explored the boundaries of the problem space. That's completely unreasonable.

My second point is that this approach is fundamentally wrong for AI-first development. If the cost of writing code is approaching zero, there's no point investing resources to perfect a system in one shot. What matters more is how fast you can explore the edges. You can now spin up five agents to implement five different versions of the thing you're building and simply pick the best one.

In our shop, we have hundreds of agents working on various problems at any given time. Most of the code gets discarded. What we accept to merge are the good parts.

Robdel12 today at 7:15 PM

I’ve gotten the absolute best results from LLMs just acting like the software engineer I’ve aspired to be the past 15 years.

Normal dev things. Scope the ticket properly, break it down. Test well. Write the correct docs.

LLM specific things are going to be gone next week

WestN today at 8:18 PM

Short take: replace TDD with BDD, and might add DDD as a spice. Otherwise this is a fairly good article.

Why not TDD? Since a lot of developers use LLMs to create tests today, plus a lot of the training data contains information on how to do this. Making it something that it either can figure out to do by itself or that it will cheat. Both equally bad.

A somewhat controversial take is that you should simply avoid writing tests which the LLM can produce by itself, similar to how we in the last week removed the agents.md file.

SirensOfTitan today at 6:59 PM

LLM-assisted development feels a lot like trend-driven development. When dealing with technique and heterogenous prompts and goals, it’s easy to gain somewhat of a gambler’s fallacy with respect to a particular technique.

Spec-driven development feels pretty questionable to me. I’m sure it works fine for feature work that is predictable or has been done before, but then I wonder why you’d waste your time with it.

Prior to LLMs, the whole vibe was to iterate rapidly toward a working thing so you can see what works and what doesn’t. Why would we abandon that strategy as an industry when the cost of writing code is ostensibly getting cheaper?

If I’m using LLMs at all, I’m using them to do a breadth search of prior art or ideas, then I’m doing what I might call a prototype onion: successive clean room attempts at a novel problem, accumulating what I learn at each attempt in each successive prompt. I usually then take the prototype and write the final version myself so I’m properly internalizing the idea.

Ultimately a lot of this prompt work feels like procrastination. It is not about understanding where these tools is useful and where they are not but trying to have them consume every aspect of the work.

choeger today at 7:41 PM

If I am not mistaken, the verification is problematic here. It's run too late.

A piece of code that satisfies a single test will most likely not be probable to adhere to the spec.

Worse, the whole spec can only be correctly implemented in total. You cannot work iteratively by satisfying one constraint after the other. The same holds for the test cases. That means that satisfying the last test or fulfilling the last constraint will take much more work than the first. The number of tests passed is not a good metric for completion of the implementation.

melvinroest today at 7:52 PM

I've been doing something less formal. I stumbled upon Riaan Zoetmulder's free course on deep learning and medical image analysis [1] and found his article on spec-driven development [2]. He adapts the V-Model by specifying three things upfront: requirements, system design and architecture. The rest gets generated. He mentioned a study where they show that LLM assistance slowed down experienced open source devs on large codebases. The model doesn't know the implicit context. And to me that's the thing! An LLM should have an index of some sort.

So I vibe coded my own static analysis program where I just track my own function calls. It outputs a call graph of all my self-defined functions and shows the name (and Python type hints) of what it is calling (excluding standard library function, also only self-defined stuff). Running that program and sending the diff from time to time seems to have helped a lot already.

[1] https://www.riaanzoetmulder.com/courses/deep-learning-medica...

[2] https://www.riaanzoetmulder.com/articles/ai-assisted-program...

jFriedensreich today at 10:32 PM

Its an interesting direction if you see it under the umbrella of diminishing costs: You build a product once with vibe coding and a design/ product hat. Once you know what works you rebuild it 100% in a framework like this. You do this every time from scratch when the tech debt or the mismatch between architecture and needs are too big.

DaylitMagic today at 5:55 PM

Some random (hopefully additive and helpful) thoughts:

Many companies have older code bases / databases that can be somewhat well defined (and somewhat not). If things have been slowly iterating over 35 years, there's a lot of undocumented edge behavior that may occur; it may be beneficial to have a step before Edge Case Catalog where there's some kind of prompting to catalogue how the inputs and outputs work, and then find the different inputs and outputs - and then confirm that with Input A and Output A that it works as expected. (Legacy systems often have weird orchestration that nobody remembers.)

(Sub-note: This is somewhat part of the provable properties catalog; while this step could be placed there, it would require a re-run of edge case catalog build potentially, which isn't a bad thing.)

A small note that I personally think is a good idea is better code commenting than has been outlined here - the spec itself should be woven into the code with potentially slightly over-commenting for each aspect, code spec gets lost. The code itself should serve as context, especially in the TDD stage.

I think it's implicit but may be worth overtly stating that for the Code Quality check in Phase 3 that it also checks on a zero-trust basis, and doesn't include things like hardcoded keys.

I'm not sure what Chainlink is (sorry!) but I like the ideas outlined around the decomposition - but it misses stringing everything together end-to-end in the way outlined here (it asks to create each part, but never actually weaves the whole together).

Something not covered - is sequencing work and decomposition of work. A spec can create multiple dependencies within itself, requiring things to be worked on in a specific order.

pron today at 7:01 PM

If you come up with a strategy that seems to "solve programming", then you know for certain there must be a flaw in it, and you need to identify where it is that corners must be cut and how.

Computer science is an introspective discipline because it studies the essential difficulty of problems regardless of the process taken to solve them, and programming itself (i.e. the problem of producing a correct, or correct-enough program) is such a problem that can be, and has been studied. The question of learning whether a program X satisfies some correctness property P is known as the model-checking problem, and we know that answering it with certainty is intractable. For example, some properties that are true for some program would take no less than 10 minutes to verify (regardless of how that verification is done), others will take no less than 10 hours, others no less than 10 months, others no less than 10 years and so on, and we don’t know ahead of time whether the proprty is true, and if it is, where on this spectrum it falls.

So suppose you decide some property must be proven with full certainty, the question becomes, how long do you wait before giving up waiting for the validation and what do you do when you give up? If you then decide that you’re okay with less than 100% confidence, what approach do you take and how much confidence do you actually have? The problem with that is that the answer to that question often requires a deep understanding of the implementation. I.e. if you have two programs, X and Y, that compute the same function, one less-than-perfect approach would give you 99% confidence with one of them, but only 10% confidence with another.

daveac today at 8:28 PM

In a perfect world I can see this happening. But with AI increasing or output I think the real bottleneck is work sponsors, business logic and requirements discovery/translating/sign off.

I am seeing more teams and features being rolled faster than before but then discovering that the sponsors (those requesting features and change) either don’t invest the time up front or with timely feedback loops and work stalls or has to be redone as business does not see the results until it’s either live or about to go live.

This has always been the case but I think AI tooling has moved the bottleneck

Animats today at 10:21 PM

Is the posting a description of a real system, or just imagination? Is there a link to something that makes this real?

teiferer today at 9:24 PM

I expected formal verification to be part of this. That could not be fooled and is rock-solid, unless you cheat in your specification. Swap your AI "verifier" out for that and I'm on board.

johnnyAghands today at 10:50 PM

Does anyone know what Chainlink is?

FrankRay78 today at 10:12 PM

No much different from what I did manually when employer outsourced development to India.

alpaylan today at 8:08 PM

You cannot escape from the human verifying the properties you want verified mechanically. This only gives you leverage in specific scenarios where specification is much simpler than the implementation.

vielite1310 today at 6:52 PM

I would like to be enlightened myself if RPI,BMAD or any spec-driven approaches actually worked for any mid/large scale projects, without wasting millions of tokens of course :)

mpalmer today at 9:30 PM

Looks moderately interesting, but I refuse to upvote vibe-written submissions. Of course documents like this have their uses, but:

- They cannot easily be attributed to a human author, and therefore debates and discussions on the substance of the ideas tend not to get too far

- They (tend to) take well-established concepts, glue them together, and describe the result in reverential tones, regardless of the relative triviality of the solution. Hard to rule that out here tbh.

I am not saying any of that is what's happening here. What I am saying is that I'm not going to waste my time reading something I can't easily vet for quality.

Author (not OP) has written plenty of its* own words on Bluesky over the last few weeks. If it's written anything longhand about this stuff I'd be interested to read it. But for now "anti-slop bias" designed into the system has not reached the prose.

*respecting pronouns

deleted today at 7:10 PM

sjbr today at 5:31 PM

Nice. It can work with something like https://github.com/github/spec-kit ?

jatins today at 8:19 PM

The gist is 100% AI written https://www.pangram.com/history/9d89ebba-cdba-40e1-b569-9ae1...

rsrsrs86 today at 9:04 PM

That’s how I do it, minus the marketing.

beders today at 7:18 PM

> Define the contract before writing a single line of implementation. Specs are the source of truth.

There is only one source of truth and that is the source code. To define and change contracts written in an ambiguous language and then hope the right code will magically appear, is completely delusional.

Iteration is the only game in town that is fast and produces results.

mitchbob today at 6:36 PM

Upvoted for the Sarcasmotron.

politician today at 5:31 PM

This is a decent approach. My concern with TDD is that writing tests necessarily implies designing an API upon which those tests operate. Here, the agent is instructed to "not write code, write tests", and yet, in doing so it defines an API. This will cause the AI to hallucinate the API. Layering in yet more tests on top of this will cause that API to deform in strange ways that pass tests but that the adversary will not be able to cope with because it runs too late in the VSDD process.

I've seen this exact process play out in my own work. The AI generates code and tests that pass with high code coverage and honors invariants set by spec. I look at the code and find a rats nest / ball of mud that will cost 10x more tokens to enhance should I ever need to add a feature.

So, I think you're on to something, but I think the process might be discounting extensibility and resilience under change.

desireco42 today at 6:21 PM

Claude or something different... there is life beyond Claude I assure you and it is quite good and colourful.

esafak today at 6:50 PM

This is AI slop not worth my time. What would be interesting is if the author shared her practical experience in implementing it. Let's see some of those specs. What tricky bugs did it catch? The author's latest repo hasn't even been passing CI, so what does that say? https://github.com/dollspace-gay/Tesseract-Vault/commits/mai...

galoisscobi today at 6:08 PM

I think this word salad doesn’t have enough buzzwords. Throw in a few more acronyms too.