AI agents break rules under everyday pressure

200 points - last Thursday at 10:52 AM

Source

Comments

hxtk today at 5:44 AM
Blameless postmortem culture recognizes human error as an inevitability and asks those with influence to design systems that maintain safety in the face of human error. In the software engineering world, this typically means automation, because while automation can and usually does have faults, it doesn't suffer from human error.

Now we've invented automation that commits human-like error at scale.

I wouldn't call myself anti-AI, but it does seem fairly obvious to me that directly automating things with AI will probably always have substantial risk and you have much more assurance, if you involve AI in the process, using it to develop a traditional automation. As a low-stakes personal example, instead of using AI to generate boilerplate code, I'll often try to use AI to generate a traditional code generator to convert whatever DSL specification into the chosen development language source code, rather than asking AI to generate the development language source code directly from the DSL.

hosh today at 2:39 PM
Humans do the same thing.

I have a friend who is a systems engineer, working at a construction company building datacenters. She tells me that someone has to absorb the risk and uncertainty in the global supply chain, and if there are contractual obligation to guarantee delivery, then the tendency is to start straying into unethical behavior, or practices that violate controls and policies.

This has as much to do with taking the slack out of the system. Something, somewhere is going to break. If you tell an agenic AI that it must complete the task by a deadline, and it cannot find a way to do that within ethical parameters, it starts searching beyond the bounds of ethical behavior. If you tell the AI it can push back, warn about slipping deadlines, that it is not worth taking ethical shortcuts to meet a deadline, then maybe it won’t.

kingstnap today at 5:55 AM
I watched Dex Horthys recent talk on YouTube [0] and something he said that might be partly a joke partly true is this.

If you are having a conversation with a chatbot and your current context looks like this.

You: Prompt

AI: Makes mistake

You: Scold mistake

AI: Makes mistake

You: Scold mistake

Then the next most likely continuation from in context learning is for the AI to make another mistake so you can Scold again ;)

I feel like this kind of shenanigans is at play with this stuffing the context with roleplay.

[0] https://youtu.be/rmvDxxNubIg?si=dBYQYdHZVTGP6Rvh

bethekidyouwant today at 2:44 PM
There’s no such thing as rules in this case, at best they are suggestions.
N_Lens today at 2:11 PM
It’s not just slopification, it’s mass slopification with fractal agents out of every orifice!!
ramoz today at 12:53 PM
Rules need empowerment.

Excited to be releasing cupcake at the end of this week. For deterministic and non-deterministic guardrailing. It integrates via hooks (we created the feature request to anthropic for Claude code).

https://github.com/eqtylab/cupcake

zone411 today at 7:47 AM
Without monitoring, you can definitely end up with rule-breaking behavior.

I ran this experiment: https://github.com/lechmazur/emergent_collusion/. An agent running like this would break the law.

"In a simulated bidding environment, with no prompt or instruction to collude, models from every major developer repeatedly used an optional chat channel to form cartels, set price floors, and steer market outcomes for profit."

ineedasername today at 2:03 PM
Why on earth would you deliberately place under pressure with a prompt that indicates this to begin with? First, an operationalized prompt ought not in the first place be under modifiable control any more that a typical workflow would be. In this case, the prompt is merely part of an overall workflow. Second, what would someone, even if they did have access, think was being accomplished by prompting, "time is short, I have a deadline, let's make this quick"? It betrays a complete misunderstanding of how this tech works.
lloydjones today at 7:17 AM
I tried to think about how we might (in the EU) start to think about this problem within the law, if of interest to anyone: https://www.europeanlawblog.eu/pub/dq249o3c/release/1
ai_updates today at 9:28 AM
Great points. In my experiments combining AI with spaced repetition and small deliberate-practice tasks, I saw retention improve dramatically — not just speed. I think the real win is designing short active tasks around AI output (quiz, explain-back, micro-project). Has anyone tried formalizing this into a daily routine?
Taniwha today at 12:06 PM
Guess what, if you AI agent does insider trading on your behalf you're still going to jail
jakozaur today at 9:03 AM
Is it just me, or do LLM code assistants do catastrophically silly things (drop a DB, delete files, wipe a disk, etc.) far more often than humans?

It looks like the training data has plenty of those examples, but the models don’t have enough grounding or warnings before doing them. I wish there were a PleaseDontDoAnythingStupidEval for software engineering.

weatherlite today at 9:56 AM
> AI agents break rules under everyday pressure

Jeez they really ARE becoming human like

joe_the_user today at 6:16 AM
Sure,

LLMs are trained on human behavior as exhibited on the Internet. Humans break rules more often under pressure and sometimes just under normal circumstances. Why wouldn't "AI agents" behave similarly?

The one thing I'd say is that humans have some idea which rules in particular to break while "agents" seem to act more randomly.

salkahfi last Friday at 8:37 PM
crooked-v today at 5:22 AM
I wonder who could have possibly predicted this being a result of using scraped web forums and Reddit posts for your training material.
sammy2255 today at 5:58 AM
..because it's in their training data? Case closed
baxuz today at 1:09 PM
What a bullshit article.

AI agents don't think, don't have a concept of time, and don't experience pressure.

I'm tired of these articles anthropomorphizing a probability engine.

dlenski today at 6:25 AM
ā€œAI agents: They're just like usā€
js8 today at 7:03 AM
CMIIW currently AI models operate in two distinct modes:

1. Open mode during learning, where they take everything that comes from the data as 100% truth. The model freely adapts and generalizes with no constraints on consistency.

2. Closed mode during inference, where they take everything that comes from the model as 100% truth. The model doesn't adapt and behaves consistently even if in contradiction with the new information.

I suspect we need to run the model in the mix of the two modes, and possibly some kind of "meta attention" (epistemological) on which parts of the input the model should be "open" (learn from it) and which parts of the input should be "closed" (stick to it).