LLMs corrupt your documents when you delegate
418 points - yesterday at 8:44 AM
SourceComments
It's unsurprising that round-tripping long content through an LLM results in corruption. Frequent LLM users already know not to do that.
They claim that tool use didn't help, which surprised me... but they also said:
> To test this, we implemented a basic agentic harness (Yao et al., 2022) with file reading, writing, and code execution tools (Appendix M). We note this is not an optimized state-of-the-art agent system; future work could explore more sophisticated harnesses.
And yeah, their basic harness consists of read_file() and write_file() - that's just round-tripping with an extra step!
The modern coding agent harnesses put a LOT of work into the design of their tools for editing files. My favorite current example of that is the Claude edit suite described here: https://platform.claude.com/docs/en/agents-and-tools/tool-us...
The str_replace and insert commands are essential for avoiding round-trip risky edits of the whole file.
They do at least provide a run_python() tool, so it's possible the better models figured out how to run string replacement using that. I'd like to see their system prompt and if it encouraged Python-based manipulation over reading and then writing the file.
Update: found that harness code here https://github.com/microsoft/delegate52/blob/main/model_agen...
The relevant prompt fragment is:
You can approach the task in whatever
way you find most effective:
programmatically or directly
by writing files
As with so many papers like this, the results of the paper reflect more on the design of the harness that the paper's authors used than on the models themselves.I'm confident an experienced AI engineer / prompt engineer / pick your preferred title could get better results on this test by iterating on the harness itself.
"Semantic ablation" is my favorite term for it: https://www.theregister.com/software/2026/02/16/semantic-abl...
They are essentially like that one JPEG meme, where each pass of saving as JPEG slightly degrades the quality until by the end its unrecognizable.
Except with LLMs, the starting point is intent. Each pass of the LLMs degrades the intent, like in the case of a precise scientific paper, just a little bit of nuance, a little bit of precision is lost with a re-wording here and there.
LLMs are mean reversion machines, the more 'outside of their training' the context/work load they are currently dealing with, the more they will tend to gradually pull that into some homogenous abstract equilibrium
What has worked well in practice is giving the agent a directory, and tell it to make independent markdown files for facts/findings it locates - with each file having front-matter for easy search-ability.
This de-complects most tasks from "research AND store iteratively in a final document format" to more cohesive tasks "research a set of facts and findings which may be helpful for a document", and "assemble the document".
Only a partial mitigation, but find it leads to more versatile re-use of findings, same as if a human was working.
It would be interesting to know if the stronger results on Python are not just an artefact of the Python-specific evaluation, if they carry over to other common general-purpose languages, and if they are driven by something specific in the training processes.
That's why harnesses and prompting rituals using dozens of markdown down files do not work as advertised and is pretty much snake oil otherwise known as "agentic engineering".
Also, the agentic engineering is pretty much so called prompt engineering except that prompt is now spread across dozens of markdown files directories.
This works well for code regressions but also works for document writing. I've automated it at this point.
A case where using the CLI agent is much better than using the web chat.
> We find that weaker models’ degradation originates primarily from content deletion, while frontier models’ degradation is attributable to corruption of content.
I think we largely already knew this. This is why we fudge around with harnesses and temperature etc.
It's like how psychopaths are eerie because there's nothing behind their eyes. AI-generated code is eerie because there's nothing between the lines. Code is in some sense theory building, and when you read a humans code you can (mostly) feel their theory working in the background. LLMs have no such theory, the code is just facts strewn about. Very weird experience to try and understand it.
We live and learn.
Still a huge fan though.
That is, the LLM should produce a diff, and the user should accept the diff. It seems like a bad pattern to just tell the LLM to edit any long document without that sort of visibility. Same goes for prose as for code.
The same is not so easy with free form text. I have been thinking about this mainly around when agents write plans or edit plans, but I think figuring out how to do this in general would be a huge breakthrough.
Logical English was one idea I came across and Runcible https://runcible.com/ was another idea I recently stumbled on.
Then I can diff what they wrote with my copy
Users are the OG container. On Linux it's possible to constrain a user to a network namespace, cgroups.
BPF can be used like docker compose to ensure a service running under a user is running
TL;DR a lot of the userspace cruft we import to run software has been rolled into the kernel over the last 10-15 years.
Ignore the terminology "user". Under the hood all the same constraint and boundary setting you want exists without downloading the entire internet
I have yet to find a model that does not make mistakes each turn. I suspect that this kind of error is fundamentally incorrigible.
The most interesting thing about LLMs is that despite the above (and its non-determinism) they're still useful.
on the flip side if you’re literally just using a bare bones harness on top of a stochastic parrot, of course stochastic errors accumulate.
theres a lot of ways for improving text faithfulness through harness tool designs, and my incremental experiments seem promising.
but unless work is gated on shit like “the script used must type checked ghc haskell or lean4”, unsupervised stuff is gonna decay
I've also had them convert to markdown something like an excel formatted document. It worked pretty well as long as I was examining the output. But the longer it ran in context, the more likely it was to try in slip things in that seemed related but wasn't part of the break down.
The only way I've found to mitigate some of it is to make every file a small-purpose built doc. This way you can definitely use git to revert changes but also limit the damage every time they touch them to the small context.
Anyone who thinks they're a genius creating docs or updating them isnt actually reading the output.
The way this experiment is conducted is not inline with how current agentic AI is used OR how even humans edit documents.
Here's how agentic AI currently typically do edits:
1. They read the whole document. 2. They come up with a patch. A diff of the section they want to edit. 3. They change THAT section only.
This is NOT what that experiment was doing. A 25% degradation rate would render the whole industry dead. No one would be using claude code because of that. The reality is... everyone is using claude code.
AI is alien to the human brain, but in many ways it is remarkably. This is one aspect of similarity in that we cannot edit a whole document holistically to produce one edit. It has to be targeted surgical edits rather then a regurgitation of the entire document with said edit.