Computer Use is 45x more expensive than structured APIs
410 points - yesterday at 4:34 PM
SourceComments
Hang on, that sounds like common corporate SaaS apps.
Well, if your backend was sufficiently decoupled from your frontend, and the server-side operations were designed thoughtfully and generically, it need not be an engineering project.
The landing page doesn't advertise it yet, but essentially, I give agents a small set of tools to explore apps' surfaces, and then an API over common macOS functions, especially those related to accessibility.
The agent explores the app, then writes a repeatable workflow for it. Then it can run that workflow through CLI: `invoke chrome pinTab`
Why accessibility? Well, turns out that it's just a good DOM in general. It's structure for apps. Not all apps implement it perfectly, but enough do to make it wildly useful.
[1] https://getinvoke.com - note that the landing page is targeted towards creatives right now and doesn't talk about this use case yet
_of course_ computer use is worse. It is your last resort. Do not use it on state that lives in a DB that you own.
If anything I am impressed that itās only 50x worse.
If one agent just explores the UI, maybe in a test environment, and outputs a somewhat-structured description of the various UI elements and their behavior, then another agent was given that description, would the other agent perform better that an agent that both explores the UI and tries to accomplish the given task at the same time?
With an example UI I made up, the description (API-like interface definition) could be something like:
Get all reviews:
To get all the reviews you need to go to each page and click "show full review" for every review summary in that page.
Go to each page:
Start at page 1 (the default when in the Reviews tab). Continue by clicking the "next" button until the "next" button is no longer available (as you've reached the last page).
So the second agent can skip some thinking about how to navigate because it already has that skill. The first agent can explore the UI on its own, once, without worrying about messing up if there's a test environment.Or am I misunderstanding the article completely? Probably. But it's interesting nonetheless. Sorry if it makes no sense.
My core idea was that "fast" perception loop is fully local, GPU optimised for UI tokenisation and change detection. "Slow" control loop requires LLM roundtrip, and uses token-efficient markdown interface in CLI output.
It uses relatively stable identifiers for controls, so agents can script common actions, eg `desktopctl pointer click --id btn_save` doesn't require UI tokenisation loop.
The only reason you wouldnāt choose an API is if it wasnāt viable.
In particular, the vision-based approach used in the evaluation has clear limitations with regard to efficiency due to its nature (small observation window, heterogeneous modality)
At Smooth we use an hybrid DOM/vision approach and we index very strongly on small models. An interesting fact is that UIs are generally designed to minimize ambiguity and supply all and only the necessary context as token-efficient as possible, and the UX is cabled up to abstract the APIs in well-understood interface patterns e.g. dropdowns or autocompletes. This makes navigation easier and that's why small models can do it, which is another dimension that must be considered
We typically recommend using APIs/MCP where available and well designed, but it's genuinely surprising how token-efficient agentic browser navigation can actually be
The problem is that not everything from the 'past' can be accessed via APIs. It would be a fun time - remember Prism [1] - I would just run that and get all the API calls in a nice format and then replay them over and over to do things in succession.
In the new world, we have access to OpenAPI.json and whatnot, but in the world where things were built in the days pre-OpenAPI and pre-specs and best practices...I am not so sure! (and a lot of world lives then)
Alas, this works for a good chunk of things but not everything. Which is why the other technnology exists.
I think OpenAI designing their own phone is the next logical step. I hope they succeed which should bring major competition to Apple and Android.
I want to be able to control both Mac, apps and the browser. I also need it to figure out things by itself given a goal.
Claude Code with the --chrome flag is kind of good, but it's too slow. I wanted to try faster APIs, like the one hosted on Cerebras, but it's too expensive.
Any solution I might be missing?
To me the browser is a translation layer. Working on the browser directly while hard enables big advantages on compatibility. The only thing I miss as of now which is on the todo is ocr of the images in the browser into text out. But an api would need to do that anyways to work.
The main loss in my view of pure API based is, where do you get the data? We won't replicate human work without seeing that done. Humans work in the UI that's it. Computer use to me is the promise of being able to replicate end to end actions a human does. API can do that in theory but the data to do that is also near impossible to collect properly.
idk.. not really thought out too much, but has to be better
Recently, I asked Claude to log into my local grocery store chain's website and add all of the items from my shopping list to a cart. It was hilariously slow, but it did get the job done.
Unless I missed it, the article doesn't explictly mention speed in the copy, but the results do show a 17 minute (!!!) total time for the vision agent vs. 0.5s - 2.8s for the API approach.
A big part of the challenge with vision is that to manipulate the DOM, you first have to be sure the entire (current) DOM is loaded. In my experience this ends up in adding a lot of artificial waits for certain elements to exist on the page.
Had we not made that wrong turn, LLMs and humans would have a much easier time reasoning about APIs they don't directly control.
> To make the comparison apples-to-apples, we rewrote the vision prompt as an explicit UI walkthrough, naming the sidebar items, tabs, and form fields the agent should interact with at each step. Fourteen numbered instructions covering the navigation the agent had failed to figure out on its own.
This is a model problem, though. Because the model failed to understand it could scroll, you forced it to consume multiples of the tokens. Could you come up with an alternative here?
Do you know what the vision model was trained on? Because often people see āvision modelā and think āhuman-level GUI navigatorā when afaik the latter has yet to be built.
I can see the appeal in pixel route given universality but wow that seems ugly on efficiency
There are usecases where the vision agent is the more obvious, or only choice though, e.g. prorprietary/locked-down desktop apps that lack an automation layer.
Try playing fruit ninja via text and llm toolcalls though
I don't think any new app should ever be specifically designed for AI to interact with them through computer use
Tokens a resource and should be managed as such.
I embedded a Google Calendar widget on my Book a demo page, I don't know the API and Google doesn't expose/maintain one either.
What we are doing at Retriever AI is to instead reverse engineer the website APIs on the fly and call them directly from within the webpage so that auth/session tokens propoagate for free: https://www.rtrvr.ai/blog/ai-subroutines-zero-token-determin...
IMO, this is the argument for doing work in the first place.
It seems like you'd need a deliberately hostile app before a vision agent would even be considered as an option.
The problem is, all the tasks are essentially: a) things agents probably just can't do, and b) things that absolutely cannot afford to be hallucinated or otherwise fucked up. So far the tasks I've thought of:
- Taxes. So it needs a lot of sensitive information to get W2's. Since I have to look up a lot of this stuff in the physical world anyway, it's not like I can just let it run wild.
- Background check for a new job. It took me 3 hrs to fill out one of them (mostly because the website was THAT bad). Being myself, I already was making mistakes just forgetting things like move in dates from 10 years ago, and having to do a lot of searching in my email for random documents. No way I'm trusting an agent with this.
- Setting up an LLC. Nope nope nope. There's a lot of annoying work involved with this, but I'm not trusting an LLM to do this.
Anyway, I guess my point is that even if an LLM was good at using my computer (so far, it seems like it wouldn't be), the kind of things I'd want an agent for are things that an LLM can't be trusted with.
Electron uses 10x more RAM than regular apps. But it's so convenient.
Python is 100x slower than C. It's in the top 3 of languages now.
Worse but more convenient always wins.
When you think of everything it takes for an AI to use what the article calls a "vision agent" then it seems as if using a purpose-made API ought to be MANY orders of magnitude faster.
For example, to automate processing emails, it needs to 1. go to Gmail 2. log in to Google if necessary (This often requires two step verification so it's hard to completely automating, but possible) 3. read the latest mail 4. check the content and choose the action - if needed, reply the email - if it mentions tasks, add them to the todo list - if it mentions schedules, add them to the calendar 5. repeat for all emails based on specified conditions. And each step requires dozens of DOM (a11y tree) analyzes and actions (fill username/password input, check keep logging in, click submit button, etc). Based on the model used, each step can take ~100s. So easy tasks can easily add up to tens of minutes or even hours.
For frequently used tasks, I write skills like /logging-in, /read-latest-emails, using playwright scripts and let the agent choose them And based on the email content, the agent chooses other tools like /write-reply, /add-todo, /add-event, etc, so that the model can only focus on the core tasks requiring thinking. It reduces the execution time drastically.
But it can buries important business logic in the playwright scripts, not the agent's instructions. For examples, simplified steps to add TODO items are like; 1. read the email 2. check if it's about todos, then decide to add them to Asana 3. extract and summarize the title, content, priority, due date, tags, etc. 3. access to Asana (log in if necessary) 4. check if there are similar tasks 5. if not, add the tasks This can take tens of minutes, and each step can have important business logic, like; - how to decide the priority and due date - how to choose tags based on the content - how to decide if two tasks are similar This information should be read and updated by not only developers, but managers and other teams. And if I write those steps in skills with playwright scripts, it improves the speed, but all those business logic are buried in the code, so not accessible by non-technical people. It's also error-prone because web sites often tweak the UI and scripts can stop working.
So it's very convenient if the agent processes these step once, then decides it's worth writing the playwright script so that the next time those mundate processs can be executed instantly.
With automatic skill generation, the agent decides by itself if there are workflows worth writing skills with playwright scripts, like /log-in, /extract-information, /check-similar-tasks, /add-tasks. Like Just-In-Time compiler, the skills are a byproduct of the agent instruction, all business logic are written in the agent's instruction, and doesn't need to be updated manually nor tracked in a version control system.
This can reduce a lot of execution time and API cost, and be applied other than browser automation, like computer use or any other agentic tasks if it's possible to write automation scripts for tasks not requiring thinking.
Me: hmm, this title confuses and infuriates Rob.
[Clicks link]
Me: Sees same title, repeat feelings of confusion and infuration
[Scrolls article down on my smartphone]
Me: Sees jpg with the same title, repeat feelings of co fusion and infuriation.
[Closes tab]
[Continues living rest of my life]
I hope this feedback is well received and understood.
Using CLI tools is much faster and token-efficient. I developed ten apps in the last two months. One reached 10,000+ monthly active users.
I ask Codex to generate SVG line by line and backtrack edit, ask it to use Inkscape to generate icons, etc...
I developed all this on $20 codex sub.
The EHRs could give companies like Akasa API access so Akasa could then just run NLP, but the EHR vendors don't grant various third parties API access for various reasons, so instead Akasa gets a seat license for each medical system they service and uses computer vision to read the screen (a cadre of Akasa medical coders review errors to stay up to date with unannounced changes from the EHR vendors) and then runs the NLP to figure out which CPT codes to assign to actually put in a bill and send the payer so the hospitals can stay afloat.
So this 45x delta is how much more the medical systems pay Akasa because Epic won't work with Akasa.
This is but one example of why US medical bills are outrageously high.