Agents need control flow, not more prompts

485 points - yesterday at 4:43 PM

Comments

827a yesterday at 8:36 PM

1000% agree. I am increasingly hesitant to believe Anthropic's continual war drum of "build for the capabilities of future models, they'll get better".

We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.

This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating. We quickly discovered during testing that there was no consistency to its (Opus 4.6 and GPT 5.4 IIRC) ability to actually orchestrate the workflow. Sometimes it would work, sometimes it wouldn't. I've also tested it once or twice against Opus 4.7 and GPT 5.5; not as extensively; but seems to have the same problems.

We ended up creating a super basic deterministic harness around the model. For each test case, trigger the model to test that test case, store results in an array, write results to file. This has made the system a billion times more reliable. But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc) because they're all so gigapilled on "the agent has to run everything" that they can't see how valuable these systems can be if you just add a wee bit of determinism to them at the right place.

beshrkayali today at 11:23 AM

Humble mention, I’ve been thinking the same thing with Ossature for the last couple of months since I started working on it: https://ossature.dev

The models are already good enough for code generation. What we need is the harness around them actually deterministically enforcing a specific path and “leashing” the models output to be aligned with the intention of the user as much as possible. You can’t make the output of the model deterministic, but you can make everything around it to be so.

Trying to make enforcements work with prompts is like a government agency investigating/auditing itself, there’s no incentive to find problems, so you’ll always inevitably get the “All Good, Boss!”

rnxrx yesterday at 6:10 PM

I wonder if a part of the problem isn't just the misapplication of LLMs in the first place. As has been mentioned elsewhere, perhaps the agent's prompt should be to write code to accomplish as much of the task in as repeatable/verifiable/deterministic a way as possible. This would hopefully include validation of the agent's output as well. The overall goal would be to keep the LLM out of doing processing that could be more efficiently (and often correctly) handled programmatically.

bwestergard yesterday at 5:09 PM

I agree with the sentiment, but I think the conclusion should be altered. When you hit the limit of prompting, you need to move from using LLMs at run time to accomplish a task to using LLMs to write software to accomplish the task. The role of LLMs at run time will generally shrink to helping users choose compliant inputs to a software system that embodies hard business rules.

jerf yesterday at 5:39 PM

This is why I frequently refer to "next generation AIs" that aren't just LLMs. LLMs are pretty cool and I expect that even if we see no further foundational advancement in AIs that we're going to continue to see them exploited in more interesting ways and optimized better. Even if the models froze as they are today, there's a lot more value to be squeezed out of them as we figure out how to do that.

However, there are some things that I think need a foundational next-generation improvement of some sort. The way that LLMs sort of smudge away "NEVER DO X" and can even after a lot of work end up seeing that as a bit of a "PLEASE DO X" seems fundamental to how they work. It can be easy to lose track of as we are still in the initial flush of figuring out what they can do (despite all we've already found), but LLMs are not everything we're looking for out of AI.

There should be some sort of architecture that can take a "NEVER DO X" and treat it as a human would. There should be some sort of architecture that instead of having a "context window" has memory hierarchies something like we do, where if two people have sufficiently extended conversations with what was initially the same AI, the resulting two AIs are different not just in their context windows but have actually become two individuals.

I of course have no more idea what this looks like than anyone else. But I don't see any reason to think LLMs are the last word in AI.

gck1 yesterday at 10:00 PM

As someone who went full circle prompt-enforcement > deterministic flow > prompt-enforcement, I disagree.

The reason why "DO NOT SKIP" fails is because your agent is responsible for too many things and there's things in context that are taking away the attention from this guidance.

But nobody said the agent that does enforcement must be the same agent that builds. While you can likely encode some smart decision making logic in your deterministic control flow, you either make it too rigid to work well, or you'll make it so complex that at that point, you might as well just use the agent, it will be cheaper to setup and maintain.

You essentially need 3 base agents:

- Supervisor that manages the loop and kicks right things into gear if things break down

- Orchestrator that delegates things to appropriate agents and enforces guardrails where appropriate

- Workers that execute units of work. These may take many shapes.

cloaky233 today at 10:43 AM

It's not that agents don't need more prompts, actually breaking the prompt into a dynamically changing prompt and a static prompt combination does resolve most of the issues. Control flow on the other hand is harnessing + context building, which is one major part of agentic workflows. So I believe a "optimized" combination of both is what we should be looking for.

throawayonthe today at 10:41 AM

i gave in and bought a month of claude (it really is a slot machine don't do it if you have an addictive personality lol) to vibecode a bit, and the Superpowers skill set is cool and all but it really seems like something that should be turned into a program

hmmmmmm maybe i could vibecode a harness based on that pi thing i've heard about, and integrate it closer with jj instead of relying on llms knowing how to use it, and make certain stages guaranteed to run... oh dear

edit: also i can't bring myself to believe the 'ultimate' form or whatever stabilizes out will be chat-based interfaces for coding and code generation

i think it's just that openai happened to strike gold with ChatGPT and nobody has time to figure anything else out because they've got to get the bazillion investor dollars with something that happens to kinda work

also afaiu all these instruct models are based on 'base' models that 'just' do text prediction, without replying with a chat format; will we see code generation models that output just code without the chat stuff?

JohnMakin yesterday at 6:38 PM

> Imagine a programming language where statements are suggestions and functions return “Success” while hallucinating. Reasoning becomes impossible; reliability collapses as complexity grows.

This is essentially declarative programming. Most traditional programming is imperative, what most developers are used to - I give the exact set of instructions and expect them to be obeyed as I write them. Agents are way more declarative than imperative - you give them a result, they work on getting that result. Now the problem of course, is in something declarative like say, SQL, this result is going to be pretty consistent and well-defined, but you're still trusting the underlying engine on how to go about it.

Thinking about agents declaratively has helped me a lot rather than to try to design these rube-goldberg "control" systems around them. Didn't get it right? Ok, I validated it's not correct, let's try again or approach it differently.

If you really need something imperative, then write something imperative! Or have the agent do so. This stuff reads like trying to use the wrong tool for the job.

dkersten today at 8:52 AM

This is something I realised late last year while using Claude Code. The LLM shouldn't be the one in control of the workflow, because the LLM can make mistakes, skip steps, hallucinate steps, etc. Its also wasteful of tokens.

I'm a firm believer that a "thin harness" is the wrong approach for this reason and that workflows should be enforced in code. Doing that allows you to make sure that the workflow is always followed and reduces tokens since the LLM no longer has to consider the workflow or read the workflow instructions. But it also allows more interesting things: you can split plans into steps and feed them through a workflow one by one (so the model no longer needs to have as strong multi-step following); you can give each workflow stage its own context or prompts; you can add workflow-stage-specific verification.

Based on my experience with Claude Code and Kilo Code, I've been building a workflow engine for this exact purpose: it lets you define sequences, branches, and loops in a configuration file that it then steps through. I've opted to passing JSON data between stages and using the `jq` language for logic and data extraction. The engine itself is written in (hand coded; the recent Claude Code bugs taught me that the core has to be solid) Rust, while the actual LLM calls are done in a subprocess (currently I have my own Typescript+Vercel AI SDK based harness, but the plan is to support third party ones like claude code cli, codex cli, etc too in order to be able to use their subscriptions).

I'm not quite ready to share it just yet, but I thought it was interesting to mention since it aims to solve the exact problem that OP is talking about.

isityettime yesterday at 7:10 PM

Afaict all harnesses are wrong in this respect, some of them deeply so.

Slash commands, for instance, are a misfeature. I should never have to wait for the chatbot finish a turn so that I can check on the status of my context window or how much money I've spent this session. Control should be orthogonal to the chat loop.

Even things that have nothing to do with controlling the text generator's input and output are entangled with chat actions for no good reason except "it's a chat thing, let's pretend we're operating an IRC bot".

There are a zillion LLM agents out there nowadays, but none of them really separate control from the agent loop from presentation well. (A few do at least have headless modes, which is cool.)

bandrami today at 6:15 AM

It's going to be hilarious in a few years when people are still using LLMs but only via a controlled vocabulary and syntax that you have to learn. It's just like how everybody moved to NoSQL 15 years ago but immediately recreated schemae in their JSON.

plumbline today at 12:32 AM

I've been thinking about this a lot actually. It can almost be related to the conversation about specialization. The more specialized a model is required to be, the less capable it seems to be at a foundational level, where as if you just aim towards a liiitle bit of abstraction, you might get the best of both worlds.

Here's a pretty specific example of what I mean, but maybe food for thought:

Podcast (20 minute digest): https://pub-6333550e348d4a5abe6f40ae47d2925c.r2.dev/EP008.ht...

Paper: https://arxiv.org/abs/2605.00225

Neywiny yesterday at 5:07 PM

If you're trying to get reliability and determinism out of the LLM, you've already lost

59nadir yesterday at 5:59 PM

This was one of the key insights in Stripe's explanations about Minions[0], their autonomous agent system; in-between non-deterministic LLM work they had deterministic nodes that handled quality assurance and so on in order to not leave those types of things to the LLMs.

0 - https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-...

Weryj today at 8:47 AM

Pure agentic loops with markdown documents as a program 'agentic workflow' is incredible for experimentation, developing and testing your workflow idea.

The second it works, bake the workflow into the harness. Yesterday I was doing just that, and the whole agent loop disappeared because the process could've been condensed into a one-shot request (+1 MorphLLM fast apply) from careful context construction. (It was an Autoresearcher)

Imanari today at 10:20 AM

As with so many things aider.chat was ahead of its time with its ability to create deterministic scripts.

k__ today at 6:36 AM

At my new job, I was assigned to improve processes with AI.

My first thought was, well agents seem nice, but I think, AI workflows are a better bet. However, I don't really understood AI or agents in depth and felt like I was just "doing things the old way" and removing flexibility from agents was a ridiculous idea.

After some research I got the impression that I was right. A well defined workflow and scope is just what's needed for AI. It's cheaper and more consistent. It probably even makes the whole thing run well with non-SOTA models.

moconnor yesterday at 9:46 PM

“Flow” moves agents through a yaml flowchart of prompts and decisions. It’s working quite well for a couple of us in Tenstorrent, more to discover here though:

https://github.com/yieldthought/flow

Happily, 5.5 is good at writing and using it.

rglover yesterday at 7:39 PM

> Babysitter: Keep a human in the loop to catch errors before they propagate.

This is the only way to guarantee AI usage doesn't burn you. Any automation beyond this is just theater, no matter how much that hurts to hear/undermines your business model.

A bird sings, a duck quacks. You don't expect the duck to start singing now, do you?

apalmer yesterday at 5:14 PM

Generally agree with this stance case in point: the breakthrough in ai coding was not that AI intelligence increased as much as that a lot of the core process execution moved out of the LLM prompt and into the harness.

andai today at 3:36 AM

Yeah, you could also see this in 2023 with Auto-GPT. People were letting GPT "drive" when what they actually needed, in most cases, was like ten lines of Python (and maybe a few calls to a llm() function).

The alternative is running your ten lines of Python in the most expensive, slowest, least reliable way possible. (Sure is popular though)

For example, most people were using the agents for internet research. It would spin for hours, get distracted or forget what it was supposed to be doing.

Meanwhile `import duckduckgo` and `import llm` and you can write ten lines that does the same thing in 20 seconds, actually runs deterministically, and costs 50x less.

The current models are much better -- good enough that the Auto-GPT is real now! -- but running poorly specified control flow in the most expensive way possible is still a bad idea.

Nizoss yesterday at 8:49 PM

If you’re interested in such deterministic scaffolding/control flow, check out Probity.

I created it to address this exact issue. It is a vendor-neutral ESLint-style policy engine and currently supports Claude Code, Codex, and Copilot.

It uses the agents hooks payloads and session history to enforce the policies. Allowing it to be setup to block commits if a file has been modified since the checks were last run, disallow content or commands using string or regex matching, and enforce TDD without the need of any extra reporter setup and it works with any language.

Feedback welcome: https://github.com/nizos/probity

trolleski today at 9:11 AM

Maybe we could devise a language which would be like a natural language but have some pretty neat formal properties... Wait...

shivnathtathe today at 9:17 AM

Observability is the missing piece here — built opensmith for exactly this reason, tracing agent control flow locally

kenjackson yesterday at 7:28 PM

I feel like people forget that they're still allowed to program. You're still allowed to create workflows tying together LLMs and agents if you want. Almost all the tools and technology that existed before LLMs are still available to be used.

nickstinemates today at 3:35 AM

This is why we built swamp[1].

Swamp teaches your Agent to build and execute repeatable workflows, makes all the data they produce searchable, and enables your team to collaborate.

We also build swamp and swamp club using swamp. You can see that process in the lab[2]. This combines all of the creativity of the LLM for the parts that matter, while providing deterministic outcomes for the parts you need to be deterministic.

1: https://swamp.club

2: https://swamp.club/lab

socketcluster yesterday at 11:05 PM

That's why I built https://saasufy.com/ as an agent tool for building data-driven realtime apps.

I started working on it piece by piece about 14 years ago. It was originally targeted at junior developers to provide them the necessary security and scalability guardrails whilst trying to maintain as much flexibility as possible. It's very flexible; most of Saasufy is itself is built using Saasufy. Only the actual user service and orchestration is custom backend code.

Also, I designed it in a way that it would help the user fast-track their learning of important concepts like authentication, access control, schema validation.

It turns out that all of these things that junior devs need are exactly what LLMs need as well.

I tested it with Claude Code originally and got consistently great results. More recently, I tested with https://pi.dev with GPT 5.5 and it seemed to be on par.

deleted yesterday at 7:53 PM

sudosteph yesterday at 7:52 PM

This is a good discussion topic. A lot of people really seem to believe that if you word a prompt just so, that you just need to throw a high-powered model at it, it will work consistently how you want. And maybe as models progress that might be the case. But right now, that's not how I've seen real life work out.

Even skills are not a catch-all, because besides the supply chain risk from using skills you pull from someone else, a lot of tasks require an assortment of skills.

I've accommodated this with my agent team (mostly sonnets fwiw) by developing what we call "operational reflexes". Basically common tasks that require multiple domains of expertise are given a lockfile defining which of the skills are most relevant (even which fragment of a skill) and how in-depth / verbose each element needs to be to accomplish the same task the same way, with minimal hallucinations or external sources.

A coordinator agent assigns the tasks and selects the relevant lockfile and sends it along or passes it along to another agent with a different specified lockfile geared towards reviewing.

It's a bit, but this workflow dramatically increased the quality of output for technical work I get from my agents and I don't really need to write many prompts myself like this.

est today at 6:40 AM

I have a question, does LLM follow these MANDATORY or DO NOT SKIP during pre-train, like how people write a comment paragraph on reddit corpus, or is it just some post-train alignment habbit?

tim-projects yesterday at 6:25 PM

This is exactly the problem I've been working on and I see others are too. When you implement quality control gates, everything works better. It solves so many of the basic problems llms create - saying code is finished when it isn't. Skipping tests, introducing code regressions, basic code validation etc

I am finding that the better the quality gates are the lower quality llm you can use for the same result (at a cost of time).

dirtbag__dad today at 12:47 AM

Build CLIs your agents call, that scaffold what you want, and lint so it actually does achieves your intended design.

Markdown files are a good reference but they are a weak enforcement tool and go stale easily.

Avoid burying yourself in more skills docs you’re not even writing yourself and probably never even read. Focus that toward deterministic tooling. (Not that skills or prompts are bad, I agree a meta skill that tells an agent what subagents and what order to run is useful)

rbren yesterday at 10:24 PM

If you're interested in driving coding agents with code, check out the OpenHands Software Agent SDK [1]

We need to define agents in code, and drive them through semi-deterministic workflows. Kick subtasks off to agents where appropriate, but do things like gather context and deal with agent output deterministically.

This is a massive boost in accuracy, cost efficiency, AND speed. Stop using tokens to do the deterministic parts of the task!

[1] https://github.com/OpenHands/software-agent-sdk

alasano today at 1:42 AM

I'm building a robust runtime for this.

It's externally orchestrated and managed, not by an agent running the the loop.

The goal is to force LLMs to produce exactly what you want every time.

I will be open sourcing soon. You can use whatever harness or tools you already use, you just delegate the actual implementation to the engine.

https://engine.build

sbinnee today at 3:36 AM

I have been telling this to my team that 1000 lines of instructions are deemed to fail no matter how great of instruction following capability of a model. I have been reviewing hundreds of line changes daily basis for about a month. I couldn’t help becoming a prayer.

illwrks yesterday at 7:28 PM

I’ve been building a small ‘agent’ using copilot at work, partly a learning exercise as well as testing it in a small use case.

My personal opinion is that AI and agents are being misrepresented… The amount of setup, guidance and testing that’s required to create smarter version of a form is insane.

At the moment my small test is: Compressed instructions (to fit within the 8k limit) 9 different types of policies to guide the agent (json) 3 actual documents outlining domain knowledge (json) 8 Topics (hint harvesting, guide rails, and the pieces of information prepared as adaptive cards for the user) 3 Tools (to allow for connectors)

The whole thing is as robust as I can make it but it still feels like a house of cards and I expect some random hiccup will cause a failure.

astra_omnia today at 12:29 AM

I think this also points to what needs to exist after the control-flow layer. Once an agent executes a bounded workflow, teams still need a reviewable object showing what authority/scope it had, what artifacts it touched, what validation ran, what evidence was retained, and what limitations remain. Logs are useful, but they are not the same thing as an action receipt.

xuhu yesterday at 6:41 PM

It sounds like the "app written in C++ calling Lua scripts, versus app written in Lua calling C++ libraries" debate.

Both designs (Lightroom, game engines) have worked successfully.

There's probably nothing that prevents mixing both approaches in the same "app".

astrobiased yesterday at 5:26 PM

It's the right direction, but control flow introduces limitations within a system that is quite adaptable to dynamic situations. The more control flow you try to do, the more buggy edge cases that pop up if done poorly.

Still have yet to see a universal treatment that tackles this well.

niyikiza yesterday at 11:19 PM

My analogy[1] has been that we need a valet key: capped speed, geofenced, short ttl, can't open trunk/glovebox, etc. That way we don't have to say pretty please to the valet and hope that they won't get ideas.

[1] https://niyikiza.com/posts/capability-delegation/

pron yesterday at 11:19 PM

How do you have "aggressive error detection" when one of the most common and pernicious mistakes agents make are architectural? The behaviour is fine, but the code is overly defensive, hiding possible bugs and invariant violations, leading to ever more layers of complexity that ultimately end up diverging when nothing can be changed without breaking something.

arian_ yesterday at 6:30 PM

Control flow tells the agent what it's allowed to do. It doesn't tell you what the agent actually did. Both matter. Everyone is building the permission layer. Almost nobody is building the verification layer.

zby yesterday at 8:15 PM

I concur - it does not make sense to do in llm prompts what can be done in code. Code is cheaper, faster, deterministic and we have lots of experience with working with code.

Especially all bookkeeping logic should move into the symbolic layer: https://zby.github.io/commonplace/notes/scheduler-llm-separa...

onion2k yesterday at 6:06 PM

Agents are probabilistic systems. A common mechanism to get a reliable answer from systems that can have variable output is to run them several times (ideally in separate, isolated instances) and then have something vote on the best result or use the most common result. This happens in things like rockets and aviation where you have multiple systems giving an answer and an orchestrator picking the result.

I've tried doing something similar with AI by running a prompt several times and then have an agent pick the best response. It works fairly well but it burns a lot of tokens.

mnalley95 yesterday at 8:48 PM

Own your control flow! A key point from 12 factor agents.

"One thing that I have seen in the wild quite a bit is taking the agent pattern and sprinkling it into a broader more deterministic DAG." - https://github.com/humanlayer/12-factor-agents/blob/main/REA...

briga yesterday at 6:14 PM

Sometimes it feels like Agents are just reinventing microservices. Except they are are doing it in the most inefficient way possible. It is certainly a good way for the LLM companies to sell more tokens

allynjalford today at 3:37 AM

Totally agree. That's why i built it. https://backpac.xyz/cairn-cli

gardnr yesterday at 6:17 PM

This is straight outta 2023:

Agents aren't reliable; use workflows instead.

kmad yesterday at 7:17 PM

This is, at least in part, the promise of frameworks like DSPy and PydanticAI. They allow you to structure LLM calls within the broader control flow of the program, with typed inputs and outputs. That doesn’t fix non-determinism, hallucinations, etc., but it does allow you to decompose what it is you’re trying to accomplish and be very precise about when an LLM is called and why.

rickysahu today at 4:52 AM

we work on this issue in healthcare (genhealth.ai) where it's imperative to get every step correct. not easy. a valuable solution at the intersection of browser, code, lmms. there r far more layers of browser interaction than just imgs and dom.

sidcool today at 8:41 AM

How does one achieve this?

chandureddyvari yesterday at 6:46 PM

I had good success with hooks in claude code. Personally I feel this problem was common with humans as well. We added tools like husky for git commits, for our peers to push code which was linted, type checked etc.

I feel hooks are integral part of your code harness, that’s only deterministic way to control coding agents.

deleted yesterday at 10:09 PM

idivett yesterday at 11:26 PM

Isn't that what they call "Harness engineering"?

hmaxdml yesterday at 7:54 PM

We've found that durable workflows is a much needed primitive for agents control flow. They give a structure for deterministic replays, observability, and, of course, fault tolerance, that operators need to make the agent loop reliable.

SrslyJosh yesterday at 11:58 PM

> "Agents need control flow, not more prompts"

Can't wait for ya'll to come full circle and invent programming from first principles.

piyh today at 2:08 AM

9 different frameworks being pushed in the comments of this thread. 2026 truly is the year of agents.

solomonb yesterday at 5:47 PM

I agree and I think a really wonderful way to encode agentic control flow would be with Polynomial Functors.

https://arxiv.org/abs/2312.00990

colek42 yesterday at 9:01 PM

We built https://aflock.ai/ (open source) to help with this. Constraining activity tends to work well

arbirk yesterday at 8:30 PM

I always wonder with these posts: - are they talking about coding (where I am the control flow) - or RPA agents (in which it is obvious) ? - also don't use llm for deterministic tasks

mohamedkoubaa yesterday at 10:21 PM

Eventually we'll all come to the inevitable conclusion that for a task to be fully automated there should be neither human nor genie in the loop.

glasner yesterday at 9:23 PM

This exactly why I’m building aiki to be a control layer for harness execution. I don’t think the model companies will ever give us the neutral layer we need.

graphememes yesterday at 10:55 PM

> If you’ve ever resorted to MANDATORY or DO NOT SKIP, you’ve hit the ceiling of prompting.

using this is going to do the opposite of what you want

jarboot yesterday at 8:44 PM

I think this is a good usecase for temporal + pydantic-ai

cesarvarela yesterday at 8:18 PM

This will remain a persistent problem without a definitive answer until models move from generative tools to actual AI.

2001zhaozhao yesterday at 9:32 PM

If we need control flows, then designing these control flows ought to be the future of agent engineering

mhotchen yesterday at 9:29 PM

HUMANS need control flow. It's a very effective strategy that has worked wonders in healthcare

aykutseker yesterday at 6:34 PM

all caps in a prompt is a code smell. when you're typing MANDATORY, you should be writing a wrapper, not refining the prose.

ModernMech yesterday at 5:59 PM

Slowly and surely we are replacing AI with programming languages.

dnautics yesterday at 7:10 PM

Yes. Humans are also unreliable and nondeterministic (though certainly more reliable). Accordingly we have built software dev practices around this. I imagine it would be super useful for example to have a "TDD enforcer":

Phase 1: only test files may be altered, exactly one new test failure must appear.

Phase 2: only code files may be altered. The phase is cleared when the test now succeeds and no other tests fail.

If you get stuck, bail and ask for guidance

deleted yesterday at 7:34 PM

ubj yesterday at 10:53 PM

I've said this before, but it's interesting to see momentum go back and forth between the flexibility and ease of everyday language, and the formal rigor of programming languages.

It feels like we are still discovering the optimal operating range on a spectrum between these two domains. Perhaps the optimal range will depend on the specific field in question.

geon yesterday at 7:30 PM

How is this not obvious to everyone? It's like people forgot how to engineer.

_pdp_ yesterday at 7:25 PM

Or maybe, just maybe, LLMs do not run deterministicly and that is ok?

In the real world almost nothing runs like that - only software and even that has a lot of failures.

So perhaps rather than trying to make agents run deterministicly the goal is to assume some failure rate and find compensation control around it.

throwthrowuknow today at 1:06 AM

Isn’t this basically what Palantir does?

zekenie yesterday at 10:10 PM

you know it really depends on what you're trying to accomplish and if it's possible to describe it with deterministic control flow

oinoom yesterday at 6:15 PM

this is just advocating for a harness, which has been the focus (along with evals) for at least the last three months by pretty much anyone working with agents professionally or seriously

eth415 yesterday at 5:15 PM

agreed - this is what we’ve been trying to build at scale.

https://github.com/salesforce/agentscript

afxuh yesterday at 7:44 PM

thats why agents completes a project with the first 3 prompts, , then maintaining and fine-tuning it take ages till hits "-Session token expired"

try-working yesterday at 8:58 PM

that's why you need a recursive workflow that creates its own artifacts per step that can later be used for verification.

terminalbraid yesterday at 8:49 PM

My friend, you have invented management.

marvinified today at 12:30 AM

Depends on the use case

ncrmro today at 1:40 AM

deepwork.md is made for this.

cookiengineer today at 12:15 AM

We have control flow. It's requirements specifications and test driven development. You just have to enforce it, so the agents cannot cheat their way around it.

I decided to build my agentic environment differently. Local only, sandboxed, enforced with Go specific requirement definitions that different agent roles cannot break as a contract.

That alone is far better than any hyped markdown-storage-sold-as-memory project I've seen in the last weeks.

Currently I am experimenting with skills tailored to other languages, because agentskills actually are kinda useless because they're not enforced nor can any of their metadata be used to predictably verify their behaviors.

My recommendation to others is: Treat LLM output as malware. Analyse its behavior, not its code. Never let LLMs work outside your sandbox. Force them to not being able to escape sandboxes. And that includes removing the Bash tool, for example, because that's not a reproducible sandbox.

Also, choose a language that comes with a strong unit testing methodology. I chose Go because it allows me to write unit tests for my tools, and even agents to agents communication down the line (with some limitations due to TestMain, but at least it's possible).

If you write your agent environment or harness in Typescript, you already failed before you started. Compiled code isn't typesafe because the compiler doesn't generate type checks in the resulting JS code.

Anyways, my two cents from the purpleteaming perspective that tries to make LLMs as deterministic as possible.

carterschonwald today at 12:28 AM

i mean of course. ive been working on this the past few months and ive a bunch of tech towards this in flight, including some harness forks to layer my ideas in. eg my oh punkin pi test bed on my github.com/cartazio page , theres some shockingly obvious ince you see it tricks that i think i can stack into a really nice harness product for just doing hard real work with these models more easily

droolingretard yesterday at 5:26 PM

Are you the guy who used to write MapleStory hacks?

ltbarcly3 yesterday at 8:16 PM

Don't listen to anyone who knows what should be done without proof. If someone 'knows' what agents 'need' then that knowledge is worth millions of dollars right now. If they haven't built it they are probably just talking shit.

deleted yesterday at 9:23 PM

yogthos yesterday at 6:16 PM

This was basically my realization as well. We are trying to get LLMs to write software the way humans do it, but they have a different set of strength and weaknesses. Structuring tooling around what LLMs actually do well seems like an obvious thing to do. I wrote about this in some detail here:

https://yogthos.net/posts/2026-02-25-ai-at-scale.html

encoderer yesterday at 6:06 PM

You can get a lot done with agentic programming without going "all in" on a gastown-like system, but I think there is a minimum viable setup:

1. an adversarial agent harness that uses one agent to create a plan and implement it, and another to review the plan and code-review each step.

2. an agentic validation suite -- a more flexible take on e2e testing.

3. some custom skills that explain how to use both of those flows.

With this in place you can formulate ideas in a chat session, produce planning artifacts, then use the adversarial system to implement the plans and the validation layer to get everything working e2e for human review.

There are a lot of tools you can use for these things but I chose to just build the tooling in the repo as I go.

ares623 today at 8:55 AM

Guys, c'mon, what are we even doing...

moron4hire yesterday at 11:12 PM

I've been building this at work. It's... shockingly not hard. People have been telling me, "get into agentic coding now or you'll get left behind" and the things they are saying need training and taste and expertise to figure out how to cajole the AI into doing a job are things that I can just write a program to do.

There's this guy at work who is kind of precious about Claude Code. When Hegseth banned Anthropic, this guy freaked out. He spent many pages ranting about how terrible Gemini and Codex are and basically nuked his project. He insisted only Claude could do his project.

Meanwhile, I managed to redo his work with GPT 4o in a weekend. No AI generated code anywhere, just being capable of writing a for-loop over a directory of files my own self. The AI part is only really necessary because folks can't be bothered to author documents with proper hierarchies.

People talk about "AI is going to eliminate boilerplate and accelerate development and we'll do new jobs that were too costly before". Yet this guy spent weeks coaxing Claude to do something that took me a few hours because "boilerplate" is really not that big of a deal. If this is the kind of job we're going to be able to do because the value-to-effort ratio was less than 1, it kind of indicates to me that there isn't a lot of value to gain at any level of effort. Yeah, it's not really worth your time to bend over and pick up a penny, but even if I had a magical penny snagging magnet, I'm still going to ignore the pennies because that's just how valueless pennies are.

If AI lets me never have to open a PowerPoint from a client to read the chart values from the piechart they screenshot and pasted into PowerPoint, that's wonderful. What more would I ever need? The rest of the work just isn't that hard. But if you think AI is going to replace people like me because it can do "boilerplate", the AI is not anywhere near as fast or cheap at getting to a reliable, consistent, repeatable process as a human for that.

deleted yesterday at 7:47 PM

empath75 yesterday at 7:48 PM

I have heard this sort of thing a lot from people working with agents, and I just think it's fundamentally misguided as a way to think of them, and if you work with them this way, you are probably setting money on fire for no reason because the tasks you are able to perform this way _don't need agents to begin with_.

You might use an LLM api call here as a translation or summary step in a deterministic workflow, but they are not acting as agents, because they lack _agency_.

The value of using an agent harness is precisely that they are _not deterministic_. You provide agents a goal, tools and constraints and they do the task they were asked to perform as best as they can figure out how to do it. You may provide them deterministic workflows as tools they can call, but those workflows, outside of the agent harness itself, should not constrain what the agent does. You are paying a lot of money for agent reasoning, not to act as an expensive data transformation pipeline.

It may be the case that a lot of agentic workflows are more properly done with fully deterministic workflows, but the goal there should be to _remove the agents entirely_ and spend those tokens on non deterministic tasks that require agentic decision making.

I do think there are fundamental limits to what agents are capable of doing unsupervised and there does need to be a lot more human guidance, observability and control over what they are doing, but that's sort of the opposite of embedding them in deterministic workflows, that is more of team integration/communication problem to solve.

AIorNot yesterday at 5:10 PM

I mean we have Langgraph, BAML etc

MagicMoonlight today at 3:00 AM

This is slop generated right?

mpaiello today at 9:48 AM

[dead]

Amber-chen today at 10:10 AM

[flagged]

nicktaobo today at 4:01 AM

[flagged]

schipperai yesterday at 11:04 PM

[flagged]

aditgupta yesterday at 7:07 PM

[dead]

fredcallagan yesterday at 9:39 PM

[flagged]

pinfloyd yesterday at 10:13 PM

[dead]

lacymorrow yesterday at 7:19 PM

[flagged]

Cart0ne yesterday at 10:33 PM

[flagged]

pschw yesterday at 9:28 PM

[dead]

pevansgreenwood today at 2:29 AM

[dead]

Linell yesterday at 5:25 PM

[dead]

deleted yesterday at 9:07 PM

noborutakahashi yesterday at 11:41 PM

[flagged]

shouvik12 yesterday at 11:37 PM

[flagged]

hajekt2 yesterday at 7:54 PM

[dead]

Amber-chen yesterday at 11:40 PM

[flagged]

pandalyt1c yesterday at 6:09 PM

[flagged]

jonahs197 yesterday at 6:57 PM

[dead]

huflungdung yesterday at 6:37 PM

[dead]

taherchhabra yesterday at 6:00 PM

I wrote something recently on how agent development differs from traditional software development

https://x.com/i/status/2051706304859881495