ProgramBench: Can Language Models Rebuild Programs from Scratch?

65 points - today at 3:46 AM

Comments

adrian_b today at 10:38 AM

> Open internet with cheating detection => cheating is widespread, 20-36% of tasks are flagged for the stronger models, with source code lookup accounting for the majority of the violations.

Therefore:

> blocking internet access entirely is the appropriate default for ProgramBench

The fact that your Anthropic coding assistant has a tendency to search on the Internet code to be inserted into your program may count for an additional copyright violation (besides the possibility of reproducing recognizable fragments of its training data).

(I do not agree that copyright, at least in its current form, should be applicable to computer programs, but it is weird that the same companies who try to exploit copyrights against others also insist on the use of coding assistants that are a workaround against copyright laws, which is the main reason why they can increase programming productivity, because they may cut and paste code that you are not allowed to copy yourself.)

weinzierl today at 10:11 AM

"Models favor monolithic, single-file implementations that diverge sharply from human-written code."

You say! I might have been just an LLM all along without even knowing it since I too prefer single file implementations.

Back in the old VB5/VB6 days Visual Studio had this mode where it showed the different functions in a file almost as if they were separate files. You could not scroll beyond the functions end but you could easily transition between that mode and global file view. I always found that a nice way of working (but admittedly the world was a lot simpler back then).

Also my preference for fewer but longer files is only there when I write the code myself. For working with AI I think smaller files are beneficial for quicker turn around between human and machine.

tadamcz today at 9:34 AM

Nice work once again from Ofir Press and team; this seems to be an idea that's in the air.

> Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task

Fwiw, this is very different from what we find in MirrorCode:

> Opus 4.6 successfully reimplements almost every program up to gotree’s size in our benchmark.

https://epoch.ai/blog/mirrorcode-preliminary-results

I don't have time right now to dig in to what could explain the difference (I'm working hard on getting the full MirrorCode out as soon as possible). But I suspect that the ProgramBench authors are either under-eliciting the AIs, or their tasks are unfair/impossible given the constraints, or both.

I hope to look more into it after releasing MirrorCode, and write up my conclusions.

_pdp_ today at 6:01 AM

I am not surprised but this one sticks out...

> Models favor monolithic, single-file implementations that diverge sharply from human-written code.

Well, all of our code is monolithic with some files close 20K lines of code and we do use coding agents - not for the original code but as of late. I've always had that hunch that splitting everything into tiny files does not improve AI coding agent performance although it feels counterintuitive due to model context constraints.

To me the important parts of a program should be clustered together so the implementation is obvious. Scattering the implementation in various files all over the source tree does not help much building the mental model.

That also closely match how software used to be written in the past too.

miguel_martin today at 6:34 AM

It’s unfortunate that they didn’t eval using subagents/orchestration for such a complex set of tasks (from what I can tell), e.g. analyze program to produce initial spec -> code -> review and rinse&repeat with each of those steps being a separate subagent allocated

I would be interested to see if there’s a significant quantifiable difference.

andy12_ today at 8:47 AM

It's interesting that Figure 4 shows that Sonnet and Opus have a very clear distinct curve from all other models, even from GPT 5.4. Anthropic superiority I guess.

vatsachak today at 5:42 AM

In before "but they did not use my agent swarm"

behaviors today at 9:19 AM

It's funny, because that task is very diverse. Any LLM will use the codebase given as a template(At least in free-tier models)

My software as a contract of behaviors works like a program bench(I even cross tested buildouts) Made an entire corpus layout for multi agent multi platform builds to be compared. Even went ahead and ran 50 contracts for an example. It honestly showed improvable areas, and distinct differences between model code.

{contract_name}/ └── submissions/ └── {date}_{os}_{agent}_{model}_{stack}/ ├── {contract}.osc.md ├── osc.osc.md └── results/ └── {contract}.snapshot.json That's it, compare to the same contract, or find a new contract to use to compare. Lot's of signed/hash pinned files are all you need to reproduce software from nothing, with an LLM.

Programbench is close to that(they have a nice paper/article here. But I don't like the work used. Having software to start with is not a bench of making code but reverse engineering.

github/s1ugh34d/osc

luca-ctx today at 6:17 AM

RE: monolithic, single-file implementations

We have a lint that caps source code files at 650 LOC and it works really well.

keyle today at 5:55 AM

How long until AI is not even writing code but producing machine code?

Think about it, all these compilers, tooling, what a waste!

I imagine a future where chipset makers will provide a model you can just prompt to "act upon that chipset" and voila, "You're absolutely right! Here is your binary."

We won't be developers, we won't be devops, we'll be rollmops! /s