Hey guys! I’m Neel, been holed up in our south park office for the past year working on model training. excited to share our research!
This is a preview of a very different type of computer use model—we train on the internet. Specifically we have 11 million hours of computer video stored on our storage cluster (previously shared https://news.ycombinator.com/item?id=45438496 !) and the model can work in 30 FPS. Since we match the fundamental form factor of computer-use, we can get our model to do CAD, browse websites, and even drive a car using arrow keys. I’m super excited to see what our model can do as we scale more, it's a fun frontier to work on (not language models :) ).
The team and I will be online responding to the comments, so drop any questions.
mcinttoday at 1:25 AM
Congratulations! I’ll be interested to see the next steps in alignment. Do you plan to start selling access, or collect more data to train bigger & better? What tasks or benchmarks are your biggest guide stars, or what was unexpectedly tricky—a few are hinted in the post.
It would be pretty interesting to see activation maps for the encoder on video, confidence building to see the compression derived from so much training.
clemvonstengellast Monday at 5:13 PM
I rly liked the point about ctrl-c only being able to be labelled retrocausally. I do think that with enough past context you should be able to know what was copied - in some sense the past does encode the future - but also an agentic decision is precisely the kind where the future is more informative than the past for reconstructing that decision.
It does make me wonder if you should have the inverse dynamics model split into specifically retrocausal and causal. You kind of do this already with the inverse and forward dynamics model, but the idea of a model that knows only
about the future training in a feedback loop with a model that knows only about the past is kind of interesting.
I think you could just do a clever masking regime in your diffusion model to achieve the same effect without a whole architecture change.
kylenessentoday at 12:09 AM
This seems like really great research, and the first time I’ve seen overwhelming praise on HN. Congrats!
I wanted to comment though that your title is not doing you any favors, and I suspect that is why this is not getting more traction (which it deserves). I fully expected some half baked GitHub repo, but instead found something truly awesome.
To use your own words, Neel, “ a very different type of computer use model” would have had me clicking faster. I’m not great at titles, however, and maybe there are better ideas out there.
Anyway, can’t wait to see how this develops! Especially looking forward to the CAD work.
lambdalooptoday at 2:42 AM
This is fascinating! Having a really strong video encoder model and then a simpler decoder from that reminds me of the recent D4RT from DeepMind as well: https://d4rt-paper.github.io/
I think we'll see more of these video encoder models in the coming years, they truly seem like magic.
segmondytoday at 2:36 AM
Nice, I have always felt the computer was the ultimate environment and screen capture the ultimate training data. Nice to see it in practice, now we have to wait to see if folks are going to argue on if your model could really learn a world model. I'm surprised this post doesn't have more comments, their site is worth checking out. Rooting for them, they are gritty, checkout their storage buildout story.
cs702yesterday at 10:37 PM
At first glance, this looks incredible to me. The authors train one model on 40K hours of computer-use video, previously labeled by contractors with keyboard and mouse actions, then use that model, in effect, to label 11M hours of computer-use video, which they use to train the computer-action model. The key advance is in compression. Quoting from the OP:
> [previous models] burn a million tokens to understand just one minute of 30 FPS computer data. Our video encoder encodes nearly 2 hours of video in the same number of tokens—that’s 50x more token-efficient than the previous state-of-the-art and 100x more token-efficient than OpenAI’s encoder.
While I was already aware that there are people working on new, more efficient "world models," this is the first one I've seen in action. I'm a bit in shock at how good it is, quite frankly.
I've added the OP, as well as a related 2018 paper on Behavioral Cloning from Obervation (BCO) to my reading list.[a] So far, I've only skimmed the 2018 paper, but it's already evident that it's well-written. I'm no expert in deep RL, and I can understand it. BTW, "Behavioral Cloning from Obervation" is a really good name, with an easy-to-remember acronym.
This looks extremely impressive, really deserves more attention here.
Are the inverse dynamics and forward dynamics models trained separately? It sounds like if the inverse dynamics model is meant to extrapolate more training data, then perhaps all that means is it takes very little data to generalize directly with the forward dynamics model assuming the right architecture.
theredsixtoday at 1:59 AM
This is one of those hacker news posts that you stumble upon and see 2 genius ideas within the span of as many paragraphs. Thanks again for sharing the diffusion based labeling algorithm. Truly demonstrates a mastery and understanding of what diffusion is capable of.
nextzcktoday at 12:42 AM
I think you guys are on the right track here. I’d love to learn more about the math behind the FDM. I don’t think folks realize how behind we are on vision, thank you for your work here.
ripped_britchestoday at 12:24 AM
Looks extremely impressive! Genuine question - why are you sharing your methods openly? I am grateful for it, but just curious your motivations.
vessenesyesterday at 11:45 PM
dammmmmmnnnn - lots to like here. I'm impressed with the 80,000 parallel website fuzzing desktops. And the 30hz (everything). Amazing.
aakashkslast Monday at 6:42 PM
The video compression is very cool. And the small tricks like binning the mouse movements.
Wonder how much data is generalizable across different UIs? ie how good will the model be at using Figma if it’s never seen it before but has seen a lot of Photoshop
rio_popperlast Monday at 5:06 PM
Curious about the masked diffusion IDM choice. They mention CTC loss and cross-entropy both underperformed — I'd love to see ablations on that. The claim that typos were "extremely common" with non-causal cross-entropy is interesting but hand-wavy without numbers.
piva00yesterday at 10:27 PM
Just wanted to say: this is might impressive research.
Really interesting breakdown, proper nerdsniped into this, thanks for the refreshing AI news outside of language models :)
sp1nningawayyesterday at 10:28 PM
May I suggest a driving demo in a parking lot with a mannequin instead of a real world video where it drives way too close to a pedestrian?
Otherwise, very cool and exciting!
ennucorelast Monday at 5:06 PM
The car thing is very impressive
By the way, do you have plans to handle the computer’s audio output?
user-today at 12:34 AM
Really really cool. I appreciate the article style a lot too.
ClaireBookwormlast Monday at 5:23 PM
What sort of fine tuning data was needed to allow the model to self-drive? One hour of video of someone driving, or extra labeling?
wasmainiacyesterday at 10:28 PM
Can it defeat captchas?
kdrag0nlast Monday at 5:34 PM
what tasks can the model do out of the box? was each of the examples a different fine tuned model?
bananzambatoday at 12:20 AM
Very impressive stuff!
Can you prompt it or is it strictly Copilot-style prediction?
LorenDByesterday at 11:42 PM
Nice that it can drive a car, but you could just use openpilot.
ennucorelast Monday at 5:08 PM
How do you tokenize the mouse inputs?
bitwizeyesterday at 11:30 PM
Looks like it's playing the special stages from Knuckles' Chaotix?
152334Hlast Tuesday at 2:22 PM
holy crap, this is so good. How did it get buried?
Obscura-yesterday at 10:01 PM
Amazing!
deletedlast Monday at 5:51 PM
akoboldfryingyesterday at 10:50 PM
My tech-informed but ML-ignorant take: This will soon be the biggest thing since ChatGPT.