The First Fully General Computer Action Model

168 points - last Monday at 5:00 PM

Source

Comments

nee1r last Monday at 5:10 PM
Hey guys! I’m Neel, been holed up in our south park office for the past year working on model training. excited to share our research!

This is a preview of a very different type of computer use model—we train on the internet. Specifically we have 11 million hours of computer video stored on our storage cluster (previously shared https://news.ycombinator.com/item?id=45438496 !) and the model can work in 30 FPS. Since we match the fundamental form factor of computer-use, we can get our model to do CAD, browse websites, and even drive a car using arrow keys. I’m super excited to see what our model can do as we scale more, it's a fun frontier to work on (not language models :) ).

The team and I will be online responding to the comments, so drop any questions.

mcint today at 1:25 AM
Congratulations! I’ll be interested to see the next steps in alignment. Do you plan to start selling access, or collect more data to train bigger & better? What tasks or benchmarks are your biggest guide stars, or what was unexpectedly tricky—a few are hinted in the post.

It would be pretty interesting to see activation maps for the encoder on video, confidence building to see the compression derived from so much training.

clemvonstengel last Monday at 5:13 PM
I rly liked the point about ctrl-c only being able to be labelled retrocausally. I do think that with enough past context you should be able to know what was copied - in some sense the past does encode the future - but also an agentic decision is precisely the kind where the future is more informative than the past for reconstructing that decision.

It does make me wonder if you should have the inverse dynamics model split into specifically retrocausal and causal. You kind of do this already with the inverse and forward dynamics model, but the idea of a model that knows only about the future training in a feedback loop with a model that knows only about the past is kind of interesting.

I think you could just do a clever masking regime in your diffusion model to achieve the same effect without a whole architecture change.

kylenessen today at 12:09 AM
This seems like really great research, and the first time I’ve seen overwhelming praise on HN. Congrats!

I wanted to comment though that your title is not doing you any favors, and I suspect that is why this is not getting more traction (which it deserves). I fully expected some half baked GitHub repo, but instead found something truly awesome.

To use your own words, Neel, “ a very different type of computer use model” would have had me clicking faster. I’m not great at titles, however, and maybe there are better ideas out there.

Anyway, can’t wait to see how this develops! Especially looking forward to the CAD work.

lambdaloop today at 2:42 AM
This is fascinating! Having a really strong video encoder model and then a simpler decoder from that reminds me of the recent D4RT from DeepMind as well: https://d4rt-paper.github.io/

I think we'll see more of these video encoder models in the coming years, they truly seem like magic.

segmondy today at 2:36 AM
Nice, I have always felt the computer was the ultimate environment and screen capture the ultimate training data. Nice to see it in practice, now we have to wait to see if folks are going to argue on if your model could really learn a world model. I'm surprised this post doesn't have more comments, their site is worth checking out. Rooting for them, they are gritty, checkout their storage buildout story.
cs702 yesterday at 10:37 PM
At first glance, this looks incredible to me. The authors train one model on 40K hours of computer-use video, previously labeled by contractors with keyboard and mouse actions, then use that model, in effect, to label 11M hours of computer-use video, which they use to train the computer-action model. The key advance is in compression. Quoting from the OP:

> [previous models] burn a million tokens to understand just one minute of 30 FPS computer data. Our video encoder encodes nearly 2 hours of video in the same number of tokens—that’s 50x more token-efficient than the previous state-of-the-art and 100x more token-efficient than OpenAI’s encoder.

While I was already aware that there are people working on new, more efficient "world models," this is the first one I've seen in action. I'm a bit in shock at how good it is, quite frankly.

I've added the OP, as well as a related 2018 paper on Behavioral Cloning from Obervation (BCO) to my reading list.[a] So far, I've only skimmed the 2018 paper, but it's already evident that it's well-written. I'm no expert in deep RL, and I can understand it. BTW, "Behavioral Cloning from Obervation" is a really good name, with an easy-to-remember acronym.

Thank you for sharing this on HN.

[a] https://arxiv.org/abs/1805.01954

alyxya last Tuesday at 2:24 AM
This looks extremely impressive, really deserves more attention here.

Are the inverse dynamics and forward dynamics models trained separately? It sounds like if the inverse dynamics model is meant to extrapolate more training data, then perhaps all that means is it takes very little data to generalize directly with the forward dynamics model assuming the right architecture.

theredsix today at 1:59 AM
This is one of those hacker news posts that you stumble upon and see 2 genius ideas within the span of as many paragraphs. Thanks again for sharing the diffusion based labeling algorithm. Truly demonstrates a mastery and understanding of what diffusion is capable of.
nextzck today at 12:42 AM
I think you guys are on the right track here. I’d love to learn more about the math behind the FDM. I don’t think folks realize how behind we are on vision, thank you for your work here.
ripped_britches today at 12:24 AM
Looks extremely impressive! Genuine question - why are you sharing your methods openly? I am grateful for it, but just curious your motivations.
vessenes yesterday at 11:45 PM
dammmmmmnnnn - lots to like here. I'm impressed with the 80,000 parallel website fuzzing desktops. And the 30hz (everything). Amazing.
aakashks last Monday at 6:42 PM
The video compression is very cool. And the small tricks like binning the mouse movements.

Wonder how much data is generalizable across different UIs? ie how good will the model be at using Figma if it’s never seen it before but has seen a lot of Photoshop

rio_popper last Monday at 5:06 PM
Curious about the masked diffusion IDM choice. They mention CTC loss and cross-entropy both underperformed — I'd love to see ablations on that. The claim that typos were "extremely common" with non-causal cross-entropy is interesting but hand-wavy without numbers.
piva00 yesterday at 10:27 PM
Just wanted to say: this is might impressive research.

Really interesting breakdown, proper nerdsniped into this, thanks for the refreshing AI news outside of language models :)

sp1nningaway yesterday at 10:28 PM
May I suggest a driving demo in a parking lot with a mannequin instead of a real world video where it drives way too close to a pedestrian?

Otherwise, very cool and exciting!

ennucore last Monday at 5:06 PM
The car thing is very impressive By the way, do you have plans to handle the computer’s audio output?
user- today at 12:34 AM
Really really cool. I appreciate the article style a lot too.
ClaireBookworm last Monday at 5:23 PM
What sort of fine tuning data was needed to allow the model to self-drive? One hour of video of someone driving, or extra labeling?
wasmainiac yesterday at 10:28 PM
Can it defeat captchas?
kdrag0n last Monday at 5:34 PM
what tasks can the model do out of the box? was each of the examples a different fine tuned model?
bananzamba today at 12:20 AM
Very impressive stuff!

Can you prompt it or is it strictly Copilot-style prediction?

LorenDB yesterday at 11:42 PM
Nice that it can drive a car, but you could just use openpilot.
ennucore last Monday at 5:08 PM
How do you tokenize the mouse inputs?
bitwize yesterday at 11:30 PM
Looks like it's playing the special stages from Knuckles' Chaotix?
152334H last Tuesday at 2:22 PM
holy crap, this is so good. How did it get buried?
Obscura- yesterday at 10:01 PM
Amazing!
deleted last Monday at 5:51 PM
akoboldfrying yesterday at 10:50 PM
My tech-informed but ML-ignorant take: This will soon be the biggest thing since ChatGPT.
aplomb1026 today at 12:32 AM
[dead]
snowhale last Monday at 6:18 PM
[dead]