Gemini Robotics-ER 1.6

183 points - today at 2:02 PM

Comments

sho_hn today at 2:27 PM

It does all start to feel like we'd get fairly close to being able to convincingly emulate a lot of human or at least animal behavior on top of the existing generative stack, by using brain-like orchestration patterns ... if only inference was fast enough to do much more of it.

The gauge-reading example here is great, but in reality of course having the system synthesize that Python script, run the CV tasks, come back with the answer etc. is currently quite slow.

Once things go much faster, you can also start to use image generation to have models extrapolate possible futures from photos they take, and then describe them back to themselves and make decisions based on that, loops like this. I think the assumption is that our brains do similar things unconsciously, before we integrate into our conscious conception of mind.

I'm really curious what things we could build if we had 100x or 1000x inference throughput.

shireboy today at 7:49 PM

Maybe dumb question: One of the use cases is instrument reading of analog instruments. My brain immediately goes to "this should have some sensor sending data, and not be analog". Is having a robot dog read analog sensors really a better fit in some cases?

vibe42 today at 4:09 PM

A parcel of land.

A few robot legs and arms, big battery, off-the-shelf GPU. Solar panels.

Prompt: "Take care of all this land within its limits and grow some veggies."

harrall today at 5:00 PM

Google and Boston Dynamics (of Spot, Atlas fame) formed a partnership a while back and they’ve been working on building models together.

Hyundai now owns Boston Dynamics and is pushing to get the robots into their factories.

Isamu today at 7:12 PM

>Our safest robotics model yet Safety is integrated into every level of our embodied reasoning models. Gemini Robotics-ER 1.6 is our safest robotics model to date, demonstrating superior compliance with Gemini safety policies on adversarial spatial reasoning tasks compared to all previous generations.

The safety guidelines are interesting, they treat them as a goal that they are aspiring to achieve, which seems realistic. It’s not quite ready for prime time yet.

skybrian today at 3:15 PM

Pointing a camera at a pressure gauge and recording a graph is something that I would have found useful and have thought about writing. Does software like that exist that’s available to consumers?

colinator today at 5:43 PM

This seems perfect to hook up to my 'LLMs can control robots over MCP' system. The idea is that LLMs are great at writing code, so let's lean in to that. I'll give it a try! I just got a bigger robot, we'll see how it does...

https://colinator.github.io/Ariel/post1.html

gallerdude today at 2:50 PM

I’ve been thinking about AI robotics lately… if internally at labs they have a GPT-2, GPT-3 “equivalent” for robotics, you can’t really release that. If a robot unloading your dishwasher breaks one of your dishes once, this is a massive failure.

So there might be awesome progress behind the scenes, just not ready for the general public.

Lucasoato today at 7:21 PM

Meanwhile, gemini 3.1 pro (that was released two months ago) was completely unavailable to me this afternoon, neither with API nor Subscription.

Nothing was reported in Google status page, not even the CLI is responding, it’s just left there waiting for an answer that will never arrive even after 10 minutes.

vessenes today at 4:08 PM

Nice. I couldn't find the part that I'm most interested in though, latency. This beats their frontier vision model for some identification tasks -- for a robotics model, I'm interested in hz. Since this is an "Embodied Reasoning" model, I'm assuming it's fairly slow - it's designed to match with on-robot faster cycle models.

Anyway, cool.

w10-1 today at 5:25 PM

Would this approach destroy critical investments in physics- or modeling-based reasoning?

I'm all for the task reasoning and the multi-view recognition, based on relevant points. I'm very uncomfortable with the loose world "understanding".

The fault model I see is that e.g., this "visual understanding" will get things mostly right: enough to build and even deliver products. However, these are only probabilistic guarantees based on training sets, and those are unlikely to survive contact with a complex interactive world, particularly since robots are often repurposed as tasks change.

So it's a kind of moral-product-hazard: it delivers initial results but delays risk to later, so product developers will have incentives to build and leave users holding the bag. (Indeed: users are responsible for integration risks anyway.)

It hacks our assumptions: we think that you can take an MVP and productize it, but in this case, you'll never backfit the model to conform to the physics in a reliable way. I doubt there's any way to harness Gemini to depend on a physics model, so we'll end up with mostly-working sunk investments out in the market - slop robots so cheap that tight ones can't survive.

mt_ today at 5:34 PM

Is there a open source mini robot kit that allows me to play-around with agentic robots?

steveharing1 today at 5:41 PM

Soon Open Source will fill the gap here as well

martythemaniak today at 6:21 PM

As the article notes regular Gemini and Gemma also have spatial reasoning capabilities, which I decided to test by seeing if Gemini could drive a little rover successfully, which it mostly did: https://martin.drashkov.com/2026/02/letting-gemini-drive-my-...

LLMs are really good at the sort of tasks that have been missing from robotics: understanding, reasoning, planning etc, so we'll likely see much more use of them in various robotics applications. I guess the main question right now is:

- who sends in the various fine-motor commands. The answer most labs/researchers have is "a smaller diffusion model", so the LLM acts as a planner, then a smaller faster diffusion model controls the actual motors. I suspect in many cases you can get away with the equivalent of a tool call - the LLM simply calls out a particular subroutine, like "go forward 1m" or "tilt camera right"

- what do you do about memory? All the models are either purely reactive or take a very small slice of history and use that as part of the input, so they all need some type of memory/state management system to actually allow them to work on a task for more than a little while. It's not clear to me whether this will be standardized and become part of models themselves, or everyone will just do their own thing.

deleted today at 3:48 PM

mark-frost today at 6:34 PM

[dead]

vipipiccf today at 4:19 PM

[dead]

jeffbee today at 2:47 PM

Showing the murder dog reading a gauge using $$$ worth of model time is kinda not an amazing demo. We already know how to read gauges with machine vision. We also know how to order digital gauges out of industrial catalogs for under $50.