LLMs corrupt your documents when you delegate

418 points - yesterday at 8:44 AM

Comments

simonw yesterday at 2:26 PM

I'm suspicious of their results with regards to tool usage.

It's unsurprising that round-tripping long content through an LLM results in corruption. Frequent LLM users already know not to do that.

They claim that tool use didn't help, which surprised me... but they also said:

> To test this, we implemented a basic agentic harness (Yao et al., 2022) with file reading, writing, and code execution tools (Appendix M). We note this is not an optimized state-of-the-art agent system; future work could explore more sophisticated harnesses.

And yeah, their basic harness consists of read_file() and write_file() - that's just round-tripping with an extra step!

The modern coding agent harnesses put a LOT of work into the design of their tools for editing files. My favorite current example of that is the Claude edit suite described here: https://platform.claude.com/docs/en/agents-and-tools/tool-us...

The str_replace and insert commands are essential for avoiding round-trip risky edits of the whole file.

They do at least provide a run_python() tool, so it's possible the better models figured out how to run string replacement using that. I'd like to see their system prompt and if it encouraged Python-based manipulation over reading and then writing the file.

Update: found that harness code here https://github.com/microsoft/delegate52/blob/main/model_agen...

The relevant prompt fragment is:

  You can approach the task in whatever
  way you find most effective:
  programmatically or directly
  by writing files

As with so many papers like this, the results of the paper reflect more on the design of the harness that the paper's authors used than on the models themselves.

I'm confident an experienced AI engineer / prompt engineer / pick your preferred title could get better results on this test by iterating on the harness itself.

causal yesterday at 12:43 PM

Yeah I've been saying this for a while: AI-washing any text will degrade it, compounding with each pass.

"Semantic ablation" is my favorite term for it: https://www.theregister.com/software/2026/02/16/semantic-abl...

timacles yesterday at 2:13 PM

Least shocking thing I've read about LLMs recently.

They are essentially like that one JPEG meme, where each pass of saving as JPEG slightly degrades the quality until by the end its unrecognizable.

Except with LLMs, the starting point is intent. Each pass of the LLMs degrades the intent, like in the case of a precise scientific paper, just a little bit of nuance, a little bit of precision is lost with a re-wording here and there.

LLMs are mean reversion machines, the more 'outside of their training' the context/work load they are currently dealing with, the more they will tend to gradually pull that into some homogenous abstract equilibrium

wtetzner yesterday at 3:43 PM

I think the problem is that we're using LLMs to do too much of the work. We should aim to design agents that use the LLM as the thinnest possible layer to translate the natural language intent into a deterministic process, minimizing round trips to the LLM as much as possible.

deleted today at 9:29 AM

buffaloPizzaBoy yesterday at 5:27 PM

I typically tell my agents to only treat document writing as a last "rendering" pass. LLMs are so good at taking sparse knowledge and compiling it, that I prefer to store knowledge as composable ideas/facts.

What has worked well in practice is giving the agent a directory, and tell it to make independent markdown files for facts/findings it locates - with each file having front-matter for easy search-ability.

This de-complects most tasks from "research AND store iteratively in a final document format" to more cohesive tasks "research a set of facts and findings which may be helpful for a document", and "assemble the document".

Only a partial mitigation, but find it leads to more versatile re-use of findings, same as if a human was working.

jonmoore yesterday at 12:42 PM

I really liked the evaluation method here - testing fidelity by round-tripping through chains of invertible steps. It was striking how even frontier models accumulated errors on seemingly computer-friendly tasks.

It would be interesting to know if the stronger results on Python are not just an artefact of the Python-specific evaluation, if they carry over to other common general-purpose languages, and if they are driven by something specific in the training processes.

wg0 today at 2:55 AM

> Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

That's why harnesses and prompting rituals using dozens of markdown down files do not work as advertised and is pretty much snake oil otherwise known as "agentic engineering".

Also, the agentic engineering is pretty much so called prompt engineering except that prompt is now spread across dozens of markdown files directories.

tim-projects today at 8:32 AM

You can get around the problem by doing a git diff of the unstaged file and a previous commit.

This works well for code regressions but also works for document writing. I've automated it at this point.

A case where using the CLI agent is much better than using the web chat.

meander_water yesterday at 2:36 PM

> We find that models are not failing due to “death by a thousand cuts” (i.e., many small errors). Instead, they main- tain near-perfect reconstruction in some rounds, and experience critical failures in a few rounds, typically losing 10-30+ points in a single round trip

> We find that weaker models’ degradation originates primarily from content deletion, while frontier models’ degradation is attributable to corruption of content.

I think we largely already knew this. This is why we fudge around with harnesses and temperature etc.

danielvaughn yesterday at 3:44 PM

I've spent the last few months reading a lot of AI-generated code. It's extremely difficult.

It's like how psychopaths are eerie because there's nothing behind their eyes. AI-generated code is eerie because there's nothing between the lines. Code is in some sense theory building, and when you read a humans code you can (mostly) feel their theory working in the background. LLMs have no such theory, the code is just facts strewn about. Very weird experience to try and understand it.

Art9681 today at 3:58 AM

Remind yourselves that most research papers are written by career students with no real world practical experience. That is all.

rmwaite yesterday at 5:08 PM

What I find fascinating about LLMs is that a lot of their failures seem strikingly similar to the failures that humans struggle with. I’m not sure what this “means” but I think it’s interesting that we can theoretically fix these failures for LLMs but for humans it is much harder. You pretty much need to educate / indoctrinate people for their entire lives and even then it’s messy and unpredictable and prone to failure—just like LLMs.

peter_retief today at 7:20 AM

I am surprised that not more people talk about this, I once had an ssh key deleted, so unexpected it took me a while to debug.

We live and learn.

Still a huge fan though.

charlie90 today at 1:09 AM

Doesnt this apply to humans as well? Thats why children play the game "Telephone" and watch as a message gets corrupted. The solution is to provide single source of truth.

andrewljohnson yesterday at 4:01 PM

LLM editing should be done to produce deterministic output.

That is, the LLM should produce a diff, and the user should accept the diff. It seems like a bad pattern to just tell the LLM to edit any long document without that sort of visibility. Same goes for prose as for code.

tmaly yesterday at 7:08 PM

When AI generates code, we have the ability to easily verify it and test it.

The same is not so easy with free form text. I have been thinking about this mainly around when agents write plans or edit plans, but I think figuring out how to do this in general would be a huge breakthrough.

Logical English was one idea I came across and Runcible https://runcible.com/ was another idea I recently stumbled on.

enrique_mendez today at 2:09 AM

I'm making tools for fighting this kind of degradation: https://github.com/JigSpec/JigSpec

pickleRick243 today at 12:51 AM

With this paper by Microsoft and the infamous paper by Apple last year, it seems the tech giants that don't have their own models are getting a bit insecure.

twobitshifter yesterday at 3:13 PM

I thought this was going to be about a problem we saw recently. Someone used an LLM to update the comment block at the start of each source file, and the LLM programmed its own tool that ended up changing ALL of the line endings when it output again with the corrected comment block. Instead of an LLM we could have used find and replace, but people are thinking LLM is the only tool.

madprops yesterday at 11:37 PM

Simple quick blind test for fun: https://w.merkoba.com/pickabot/

deferredgrant yesterday at 11:04 PM

Delegation needs a boundary. If the task is "improve this section," the system should make it very obvious what it touched and what it left alone.

woeirua yesterday at 1:26 PM

It's an interesting paper, but I'd like to see a lot more about the types of errors that the LLM makes. Are they happening in the forward pass or the inverse pass? My guess is the inverse pass.

y3ahd0g yesterday at 7:12 PM

Yeah so I run my agents as a different user that do not have write perms to my /home

Then I can diff what they wrote with my copy

Users are the OG container. On Linux it's possible to constrain a user to a network namespace, cgroups.

BPF can be used like docker compose to ensure a service running under a user is running

TL;DR a lot of the userspace cruft we import to run software has been rolled into the kernel over the last 10-15 years.

Ignore the terminology "user". Under the hood all the same constraint and boundary setting you want exists without downloading the entire internet

adampunk yesterday at 1:42 PM

LLMs will make mistakes on every turn. The mistakes will have little to no apparent connection to "difficulty" or what may or may not be prevalent in the training data. They will be mistakes at all levels of operation, from planning to code writing to reporting. Whether those mistakes matter and whether you catch them is mostly up to you.

I have yet to find a model that does not make mistakes each turn. I suspect that this kind of error is fundamentally incorrigible.

The most interesting thing about LLMs is that despite the above (and its non-determinism) they're still useful.

rao-v yesterday at 6:01 PM

May your contexts always be short

carterschonwald yesterday at 3:00 PM

this is literally just “leave a child at the work computer with a real doc open playing office”. otoh it is good to design benchmarks tonground these things.

on the flip side if you’re literally just using a bare bones harness on top of a stochastic parrot, of course stochastic errors accumulate.

theres a lot of ways for improving text faithfulness through harness tool designs, and my incremental experiments seem promising.

but unless work is gated on shit like “the script used must type checked ghc haskell or lean4”, unsupervised stuff is gonna decay

cyanydeez yesterday at 12:45 PM

I played around with a local LLM to try and build a wiki like DAG. It made a lot of stupid errors from vague generic things like interpreting based on file names to not following redirects and placing the redirect response in them.

I've also had them convert to markdown something like an excel formatted document. It worked pretty well as long as I was examining the output. But the longer it ran in context, the more likely it was to try in slip things in that seemed related but wasn't part of the break down.

The only way I've found to mitigate some of it is to make every file a small-purpose built doc. This way you can definitely use git to revert changes but also limit the damage every time they touch them to the small context.

Anyone who thinks they're a genius creating docs or updating them isnt actually reading the output.

threethirtytwo yesterday at 2:38 PM

This experiment needs to be put in perspective. Let me explain. IF you did this SAME experiment with a human and had a human read an ENTIRE document and then reproduce said document with edits. The DOCUMENT would DEGRADE even more.

The way this experiment is conducted is not inline with how current agentic AI is used OR how even humans edit documents.

Here's how agentic AI currently typically do edits:

1. They read the whole document. 2. They come up with a patch. A diff of the section they want to edit. 3. They change THAT section only.

This is NOT what that experiment was doing. A 25% degradation rate would render the whole industry dead. No one would be using claude code because of that. The reality is... everyone is using claude code.

AI is alien to the human brain, but in many ways it is remarkably. This is one aspect of similarity in that we cannot edit a whole document holistically to produce one edit. It has to be targeted surgical edits rather then a regurgitation of the entire document with said edit.

bigstrat2003 yesterday at 4:50 PM

We don't need a study to tell us that LLMs always make mistakes. We already knew that. Anyone with sense is not using LLMs because of that.

Ozzie-D today at 9:58 AM

[dead]

30030 today at 8:13 AM

[flagged]

30030 today at 9:04 AM

[flagged]

GhostDriftInc today at 12:02 AM

[flagged]

xiaosong001 today at 2:08 AM

[flagged]

simonreiff yesterday at 3:41 PM

[dead]

clearstack yesterday at 10:04 PM

[flagged]

arian_ yesterday at 2:54 PM

[flagged]

Amber-chen yesterday at 11:41 PM

[dead]

Bmello11 yesterday at 5:39 PM

[flagged]

BrightGirl yesterday at 3:58 PM

[flagged]

OfekSh yesterday at 4:04 PM

[flagged]