Software factories and the agentic moment

120 points - today at 3:05 PM

Comments

Alex_L_Wood today at 9:18 PM

>If you haven't spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement

…What am I even reading? Am I crazy to think this is a crazy thing to say, or it’s actually crazy?

noosphr today at 6:18 PM

I was looking for some code, or a product they made, or anything really on their site.

The only github I could find is: https://github.com/strongdm/attractor

    Building Attractor

    Supply the following prompt to a modern coding agent
    (Claude Code, Codex, OpenCode, Amp, Cursor, etc):
  
    codeagent> Implement Attractor as described by
    https://factory.strongdm.ai/

Canadian girlfriend coding is now a business model.

Edit:

I did find some code. Commit history has been squashed unfortunately: https://github.com/strongdm/cxdb

There's a bunch more under the same org but it's years old.

CuriouslyC today at 5:56 PM

Until we solve the validation problem, none of this stuff is going to be more than flexes. We can automate code review, set up analytic guardrails, etc, so that looking at the code isn't important, and people have been doing that for >6 months now. You still have to have a human who knows the system to validate that the thing that was built matches the intent of the spec.

There are higher and lower leverage ways to do that, for instance reviewing tests and QA'ing software via use vs reading original code, but you can't get away from doing it entirely.

codingdave today at 5:21 PM

> If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement

At that point, outside of FAANG and their salaries, you are spending more on AI than you are on your humans. And they consider that level of spend to be a metric in and of itself. I'm kinda shocked the rest of the article just glossed over that one. It seems to be a breakdown of the entire vision of AI-driven coding. I mean, sure, the vendors would love it if everyone's salary budget just got shifted over to their revenue, but such a world is absolutely not my goal.

insuranceguru today at 9:32 PM

the agentic shift is where the legal and insurance worlds are really going to struggle. we know how to model human error, but modeling an autonomous loop that makes a chain of small decisions leading to a systemic failure is a whole different beast. the audit trail requirements for these factories are going to be a regulatory nightmare.

amarant today at 6:34 PM

"If you haven't spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement"

Apart from being a absolutely ridiculous metric, this is a bad approach, at least with current generation models. In my experience, the less you inspect what the model does, the more spaghetti-like the code will be. And the flying spaghetti monster eats tokens faster than you can blink! Or put more clearly: implementing a feature will cost you a lot more tokens in a messy code base than it does in a clean one. It's not (yet) enough to just tell the agent to refactor and make it clean, you have to give it hints on how to organise the code.

I'd go do far as to say that if you're burning a thousand dollars a day per engineer, you're getting very little bang for your tokens.

And your engineers probably look like this: https://share.google/H5BFJ6guF4UhvXMQ7

simonw today at 5:01 PM

This is the stealth team I hinted at in a comment on here last week about the "Dark Factory" pattern of AI-assisted software engineering: https://news.ycombinator.com/item?id=46739117#46801848

I wrote a bunch more about that this morning: https://simonwillison.net/2026/Feb/7/software-factory/

This one is worth paying attention to to. They're the most ambitious team I've see exploring the limits of what you can do with this stuff. It's eye-opening.

lubujackson today at 10:56 PM

I explored the different mental frameworks for how we use LLMs here: https://yagmin.com/blog/llms-arent-tools/ I think the "software factory" is currently the end state of using LLMs in most people's minds, but I think there is (at least) one more level: LLMs as applications.

Which is more or less creating a customized harness. There is a lot more that is possiible once we move past the idea that harnesses are just for workflow variations for engineers.

japhyr today at 5:13 PM

> That idea of treating scenarios as holdout sets—used to evaluate the software but not stored where the coding agents can see them—is fascinating. It imitates aggressive testing by an external QA team—an expensive but highly effective way of ensuring quality in traditional software.

This is one of the clearest takes I've seen that starts to get me to the point of possibly being able to trust code that I haven't reviewed.

The whole idea of letting an AI write tests was problematic because they're so focused on "success" that `assert True` becomes appealing. But orchestrating teams of agents that are incentivized to build, and teams of agents that are incentivized to find bugs and problematic tests, is fascinating.

I'm quite curious to see where this goes, and more motivated (and curious) than ever to start setting up my own agents.

Question for people who are already doing this: How much are you spending on tokens?

That line about spending $1,000 on tokens is pretty off-putting. For commercial teams it's an easy calculation. It's also depressing to think about what this means for open source. I sure can't afford to spend $1,000 supporting teams of agents to continue my open source work.

Dumblydorr today at 10:38 PM

What would happen if these agents are given a token lifespan, and are told to continually spend tokens to create more agentic children, and give their genetic and data makeup such as it is to children that it creates with other agents sexually potentially, but then tokens are limited and they can not get enough without certain traits.

Wouldn’t they start to evolve to be able to reproduce more and eat more tokens? And then they’d be mature agents to take further human prompts to gain more tokens?

Would you see certain evolutionary strategies reemerge like carnivores eating weaker agents for tokens, eating of detritus of old code, or would it be more like evolution of roles in a company?

I assume the hurdles would be agents reproducing? How is that implemented?

rileymichael today at 6:07 PM

> In rule form: - Code must not be written by humans - Code must not be reviewed by humans

as a previous strongDM customer, i will never recommend their offering again. for a core security product, this is not the flex they think it is

also mimicking other products behavior and staying in sync is a fools task. you certainly won't be able to do it just off the API documentation. you may get close, but never perfect and you're going to experience constant breakage

kykat today at 10:01 PM

I'm just going to say: When opening the "twins" (bad clones) screenshots, I pressed the right key to view the next image, and surprise, the next "article" of the top navigation bar was loaded, instead of showing the next image.

Is this the quality we should expect from agentic? From my experiments with claude code, yes, the UX details are never there. Especially for bigger features. It can work reasonably well independently up to a "module" level (with clear interfaces). But for full app design, while technically possible, the UX and visual design is just not there.

And I am very not attracted to the idea of polishing such an agentic apps. A solution could be: 1. The boss prompts the system with what he wants. 2. The boss outsources to india the task of polishing the rough edges.

===

More on the arrow keys navigation: Pressing right on the last "Products" page loops to the first "Story" page, yet pressing left on the first page does nothing. Typical UX inconsistency of vibe coded software.

swisniewski today at 10:25 PM

Some of this is people trying to predict the future.

And it’s not unreasonable to assume it’s going there.

That being said, the models are not there yet. If you care about quality, you still need humans in the loop.

Even when given high quality specs, and existing code to use as an example, and lots of parallelism and orchestration, the models still make a lot of mistakes.

There’s lots of room for Software Factories, and Orchestrators, and multi agent swarms.

But today you still need humans reviewing code before you merge to main.

Models are getting better, quickly, but I think it’s going to be a while before “don’t have humans look at the code” is true.

politelemon today at 6:30 PM

> we transitioned from boolean definitions of success ("the test suite is green") to a probabilistic and empirical one. We use the term satisfaction to quantify this validation: of all the observed trajectories through all the scenarios, what fraction of them likely satisfy the user?

Oh, to have the luxury of redefining success and handwaving away hard learned lessons in the software industry.

galoisscobi today at 6:54 PM

What has strongdm actually built? Are their users finding value from their supposed productivity gains?

If their focus is to only show their productivity/ai system but not having built anything meaningful with it, it feels like one of those scammy life coaches/productivity gurus that talk about how they got rich by selling their courses.

Herring today at 5:55 PM

$100 says they're still doing leetcode interviews.

If everyone can do this, there won't be any advantage (or profit) to be had from it very soon. Why not buy your own hardware and run local models, I wonder.

d0liver today at 5:46 PM

> As I understood it the trick was effectively to dump the full public API documentation of one of those services into their agent harness and have it build an imitation of that API, as a self-contained Go binary. They could then have it build a simplified UI over the top to help complete the simulation.

This is still the same problem -- just pushed back a layer. Since the generated API is wrong, the QA outcomes will be wrong, too. Also, QAing things is an effective way to ensure that they work _after_ they've been reviewed by an engineer. A QA tester is not going to test for a vulnerability like a SQL injection unless they're guided by engineering judgement which comes from an understanding of the properties of the code under test.

The output is also essentially the definition of a derivative work, so it's probably not legally defensible (not that that's ever been a concern with LLMs).

hnthrow0287345 today at 5:58 PM

Yep, you definitely want to be in the business of selling shovels for the gold rush.

wrs today at 5:38 PM

On the cxdb “product” page one reason they give against rolling your own is that it would be “months of work”. Slipped into an archaic off-brand mindset there, no?

simianwords today at 6:07 PM

I like the idea but I'm not so sure this problem can be solved generally.

As an example: imagine someone writing a data pipeline for training a machine learning model. Anyone who's done this knows that such a task involves lots data wrangling work like cleaning data, changing columns and some ad hoc stuff.

The only way to verify that things work is if the eventual model that is trained performs well.

In this case, scenario testing doesn't scale up because the feedback loop is extremely large - you have to wait until the model is trained and tested on hold out data.

Scenario testing clearly can not work on the smaller parts of the work like data wrangling.

mccoyb today at 6:55 PM

Effectively everyone is building the same tools with zero quantitative benchmarks or evidence behind the why / ideas … this entire space is a nightmare to navigate because of this. Who cares without proper science, seriously? I look through this website and it looks like a preview for a course I’m supposed to buy … when someone builds something with these sorts of claims attached, I assume that there is going to be some “real graphs” (“these are the number of times this model deviated from the spec before we added error correction …”)

What we have instead are many people creating hierarchies of concepts, a vast “naming” of their own experiences, without rigorous quantitative evaluation.

I may be alone in this, but it drives me nuts.

Okay, so with that in mind, it amounts to heresay “these guys are doing something cool” — why not shut up or put up with either (a) an evaluation of the ideas in a rigorous, quantitative way or (b) apply the ideas to produce an “hard” artifact (analogous, e.g., to the Anthropic C compiler, the Cursor browser) with a reproducible pathway to generation.

The answer seems to be that (b) is impossible (as long as we’re on the teet of the frontier labs, which disallow the kind of access that would make (b) possible) and the answer for (a) is “we can’t wait we have to get our names out there first”

I’m disappointed to see these types of posts on HN. Where is the science?

mellosouls today at 3:07 PM

Having submitted this I would also suggest the website admin revisit their testing; its very slow on my phone. Obviously fails on aesthetics and accessibility as well. Submitted for the essay.

eclipsetheworld today at 6:00 PM

I have been working on my own "Digital Twins Universe" because 3rd-party SaaS tools often block the tight feedback loops required for long-horizon agentic coding. Unlike Stripe, which offers a full-featured environment usable in both development and staging, most B2B SaaS companies lack adequate fidelity (e.g., missing webhooks in local dev) or even a basic staging environment.

Taking the time to point a coding agent towards the public (or even private) API of a B2B SaaS app to generate a working (partial) clone is effectively "unblocking" the agent. I wouldn't be surprised if a "DTU-hub" eventually gains traction for publishing and sharing these digital twins.

I would love to hear more about your learnings from building these digital twins. How do you handle API drift? Also, how do you handle statefulness within the twins? Do you test for divergence? For example, do you compare responses from the live third-party service against the Digital Twin to check for parity?

easeout today at 4:58 PM

> A problem repeatedly occurred on "https://factory.strongdm.ai/".

neya today at 6:28 PM

The solution to this problem is not throwing everything at AI. To get good results from any AI model, you need an architect (human) instructing it from the top. And the logic behind this is that AI has been trained on millions of opinions on getting a particular task done. If you ask a human, they almost always have one opinionated approach for a given task. The human's opinion is a derivative of their lived experience, sometimes foreseeing all the way to the end result an AI cannot foresee. Eg. I want a database column a certain type because I'm thinking about adding an E-Commerce feature to my CMS later. An AI might not have this insight.

Of course, you can't always tell the model what to do, especially if it is a repeated task. It turns out, we already solved this decades ago using algorithms. Repeatable, reproducible, reliable. The challenge (and the reward) lies in separating the problem statement into algorithmic and agentic. Once you achieve this, the $1000 token usage is not needed at all.

I have a working prototype of the above and I'm currently productizing it (shameless plug):

https://designflo.ai

However - I need to emphasize, the language you use to apply the pattern above matters. I use Elixir specifically for this, and it works really, really well.

It works based off starting with the architect. You. It feeds off specs and uses algorithms as much as possible to automate code generation (eg. Scaffolding) and only uses AI sparsely when needed.

Of course, the downside of this approach is that you can't just simply say "build me a social network". You can however say something like "Build me a social network where users can share photos, repost, like and comment on them".

Once you nail the models used in the MVC pattern, their relationships, the software design is pretty much 50% battle won. This is really good for v1 prototypes where you really want best practices enforced, OSWAP compliant code, security-first software output which is where a pure agentic/AI approach would mess up.

stego-tech today at 6:33 PM

IT perspective here. Simon hits the nail on the head as to what I'm genuinely looking forward to:

> How do you clone the important parts of Okta, Jira, Slack and more? With coding agents!

This is what's going to gut-punch most SaaS companies repeatedly over the next decade, even if this whole build-out ultimately collapses in on itself (which I expect it to). The era of bespoke consultants for SaaS product suites to handle configuration and integrations, while not gone, are certainly under threat by LLMs that can ingest user requirements and produce functional code to do a similar thing at a fraction of the price.

What a lot of folks miss is that in enterprise-land, we only need the integration once. Once we have an integration, it basically exists with minimal if any changes until one side of the integration dies. Code fails a security audit? We can either spool up the agents again briefly to fix it, or just isolate it in a security domain like the glut of WinXP and Win7 boxen rotting out there on assembly lines and factory floors.

This is why SaaS stocks have been hammered this week. It's not that investors genuinely expect huge players to go bankrupt due to AI so much as they know the era of infinite growth is over. It's also why big AI companies are rushing IPOs even as data center builds stall: we're officially in a world where a locally-run model - not even an Agent, just a model in LM Studio on the Corporate Laptop - can produce sufficient code for a growing number of product integrations without any engineer having to look through yet another set of API documentation. As agentic orchestration trickles down to homelabs and private servers on smaller, leaner, and more efficient hardware, that capability is only going to increase, threatening profits of subscription models and large AI companies. Again, why bother ponying up for a recurring subscription after the work is completed?

For full-fledged software, there's genuine benefit to be had with human intervention and creativity; for the multitude of integrations and pipelines that were previously farmed out to pricey consultants, LLMs will more than suffice for all but the biggest or most complex situations.

navanchauhan today at 5:08 PM

(I’m one of the people on this team). I joined fresh out of college, and it’s been a wild ride.

I’m happy to answer any questions!

svilen_dobrev today at 8:22 PM

how about the elephant.. Apart of business-spec itself, Where-from all those (supply-chain) API specs/documentation are going to come? After, say, 3 iterations in this vein, of the API-makers themselves ??

CubsFan1060 today at 5:29 PM

I can't tell if this is genius or terrifying given what their software does. Probably a bit of both.

I wonder what the security teams at companies that use StrongDM will think about this.

g947o today at 5:29 PM

Serious question: what's keeping a competitor from doing the same thing and doing it better than you?

deleted today at 5:12 PM

rhrthg today at 5:09 PM

Can you disclose the number of Substack subscriptions and whether there is an unusual amount of bulk subscriptions from certain entities?

srcreigh today at 6:22 PM

This is just sleight of hand.

In this model the spec/scenarios are the code. These are curated and managed by humans just like code.

They say "non interactive". But of course their work is interactive. AI agents take a few minutes-hours whereas you can see code change result in seconds. That doesn't mean AI agents aren't interactive.

I'm very AI-positive, and what they're doing is different, but they are basically just lying. It's a new word for a new instance of the same old type of thing. It's not a new type of thing.

The common anti-AI trope is "AI just looked at <human output> to do this." The common AI trope from the StrongDM is "look, the agent is working without human input." Both of these takes are fundamentally flawed.

AI will always depend on humans to produce relevant results for humans. It's not a flaw of AI, it's more of a flaw of humans. Consequently, "AI needs human input to produce results we want to see" should not detract from the intelligence of AI.

Why is this true? At a certain point you just have Kolmogorov complexity, AI having fixed memory and fixed prompt size, pigeonhole principle, not every output is possible to be produced even with any input given specific model weights.

Recursive self-improvement doesn't get around this problem. Where does it get the data for next iteration? From interactions with humans.

With the infinite complexity of mathematics, for instance solving Busy Beaver numbers, this is a proof that AI can in fact not solve every problem. Humans seem to be limited in this regard as well, but there is no proof that humans are fundamentally limited this way like AI. This lack of proof of the limitations of humans is the precise advantage in intelligence that humans will always have over AI.

chopete3 today at 9:27 PM

"These go to 11" The context behind it

https://m.youtube.com/watch?v=4xgx4k83zzc&pp=ygUOdGhlc2UgZ28...

beklein today at 4:09 PM

Relevant blog post from simonw: https://simonwillison.net/2026/Feb/7/software-factory/

dist-epoch today at 6:54 PM

Gas Town, but make it Enterprise.

threecheese today at 5:54 PM

So much of this resonated with me, and I realize I’ve arrived at a few of the techniques myself (and with my team) over the last several months.

THIS FRIGHTENS ME. Many of us sweng are either going be FIRE millionaires, or living under a bridge, in two years.

I’ve spent this week performing SemPort; found a ts app that does a needed thing, and was able to use a long chain of prompts to get it completely reimplemented in our stack, using Gene Transfer to ensure it uses some existing libraries and concrete techniques present in our existing apps.

Now not only do I have an idiomatic Python port, which I can drop right into our stack, but I have an extremely detailed features/requirements statement for the origin typescript app along with the prompts for generating it. I can use this to continuously track this other product as it improves. I also have the “instructions infrastructure” to direct an agent to align new code to our stack. Two reusable skills, a new product, and it took a week.

layer8 today at 6:19 PM

So, what does DM stand for?

AlexeyBrin today at 6:26 PM

    Code must not be written by humans
    Code must not be reviewed by humans

I feel like I'm taking crazy pills. I would avoid this company like the plague.

janlucien today at 10:26 PM

[dead]