A one-armed robot stood in front of a table. On the table were three plastic figurines: a lion, a whale and a dinosaur.
An engineer instructed the robot: “Pick up the extinct animal.”
The robot buzzed for a moment, then its arm extended and its claw opened and descended. It grabbed the dinosaur.
Until recently, this demonstration, which I witnessed last week during a podcast interview at Google’s robotics division in Mountain View, California, had been impossible. Robots weren’t able to reliably manipulate objects they’d never seen before, and they certainly weren’t able to make the logical jump from “extinct animal” to “plastic dinosaur.”
But a quiet revolution is underway in robotics, one that piggybacks on recent developments in so-called grand language models — the same type of artificial intelligence system that powers ChatGPT, Bard, and other chatbots.
Google recently started plugging state-of-the-art language models into its robots, giving them the equivalent of an artificial brain. The mysterious project has made the robots much smarter and given them new understanding and problem-solving abilities.
I caught a glimpse of that progress during a private demonstration of Google’s latest robot model, called RT-2. The model, to be unveiled Friday, is a first step toward what Google executives describe as a quantum leap in the way robots are built and programmed.
“As a result of this change, we have had to rethink our entire research program,” said Vincent Vanhoucke, head of robotics at Google DeepMind. “Many of the things we worked on before have been completely nullified.”
Robots still lack human-level dexterity and fail at some basic tasks, but Google’s use of AI language models to give robots new skills of reasoning and improvisation is a promising breakthrough, said Ken Goldberg, a professor of robotics at the University of California, Berkeley.
“What’s really impressive is how it links semantics to robots,” he said. “That’s very exciting for robotics.”
To understand the magnitude of this, it helps to know a little about how robots are conventionally built.
For years, engineers at Google and other companies trained robots to perform a mechanical task — flipping a hamburger, say — by programming them with a specific list of instructions. (Lower the spatula 6.5 inches, slide it forward until it meets resistance, raise it 4.2 inches, rotate it 180 degrees, and so on.) Robots would then practice the task over and over, with engineers repeating the instructions each time. adjust until they are correct.
This approach worked for certain limited applications. But training robots in this way is slow and labor intensive. It requires collecting a lot of data from real-world tests. And if you wanted to teach a robot to do something new – flip a pancake instead of a hamburger, for example – you usually had to program it from scratch.
Partly because of these limitations, hardware robots have improved less quickly than their software-based siblings. OpenAI, the creator of ChatGPT, disbanded its robotics team in 2021, citing slow progress and a lack of quality training data. In 2017, Alphabet, Google’s parent company, sold Boston Dynamics, a robotics company it acquired, to Japanese technology conglomerate SoftBank. (Now owned by Hyundai, Boston Dynamics appears to exist primarily to produce viral videos of humanoid robots performing terrifying feats of agility.)
In recent years, Google researchers had an idea. What if, instead of being programmed one by one for specific tasks, robots could use an AI language model – one trained on huge chunks of internet text – to learn new skills for themselves?
“We started playing with these language models about two years ago, and then we realized that they have a lot of knowledge in them,” said Karol Hausman, a Google researcher. “So we started connecting them with robots.”
Google’s first attempt at merging language models and physical robots was a research project called PaLM-SayCan, unveiled last year. It attracted some attention, but its usefulness was limited. The robots lacked the ability to interpret images – a crucial skill if you want them to be able to navigate the world. They could write down step-by-step instructions for different tasks, but they couldn’t translate those steps into actions.
Google’s new robot model, RT-2, can do just that. It’s what the company calls a “vision-language-action” model, or an AI system that can not only see and analyze the world around it, but also tell a robot how to move.
It does this by translating the robot’s movements into a series of numbers — a process called tokenizing — and incorporating those tokens into the same training data as the language model. Ultimately, just like ChatGPT or Bard learns to guess which words should go in a poem or a history essay, RT-2 can learn to guess how to move a robot’s arm to pick up a ball or an empty soda can in the trash to throw. bin.
“In other words, this model can teach a robot to speak,” said Mr. Hausman.
During an hour-long demonstration, which took place in a Google office kitchen littered with dollar store items, my podcast co-host and I watched RT-2 perform some impressive tasks. One successfully followed complex instructions such as “move the Volkswagen to the German flag”, which RT-2 did by finding and holding a model VW bus and placing it on top of a miniature German flag a few feet away.
It was also able to follow instructions in languages other than English and even make abstract connections between related concepts. Once when I wanted the RT-2 to pick up a soccer ball, I instructed it to ‘pick up Lionel Messi’. RT-2 got it right on the first try.
The robot was not perfect. It incorrectly identified the flavor of a can of LaCroix placed on the table in front of it. (The can was lemon; RT-2 guessed orange.) Another time, when asked what kind of fruit was on a table, the robot simply replied “white.” (It was a banana.) A spokeswoman for Google said the robot had used a cached answer to a previous tester’s question because the Wi-Fi was momentarily down.
Google has no immediate plans to sell or widely release RT-2 robots, but the researchers believe these new language-enabled machines will eventually be useful for more than parlor tricks. Robots with built-in language models could be placed in warehouses, used in medicine or even employed as household helpers — folding the laundry, unloading the dishwasher, tidying the house, they said.
“This really makes it possible to use robots in environments where there are people,” said Mr Vanhoucke. “In office environments, in home environments, in all places where many physical tasks have to be performed.”
Of course, moving objects in the cluttered, chaotic physical world is more difficult than in a controlled laboratory. And given that AI language models often make mistakes or come up with nonsensical answers — what researchers call hallucination or confabulation — using them as the brains of robots could pose new risks.
But Mr. Goldberg, a Berkeley robotics professor, said those risks were still small.
“We’re not talking about letting go of these things,” he said. “In these lab settings, they’re just trying to push some objects on a table.”
For its part, Google said RT-2 was equipped with numerous safety features. In addition to a big red button on the back of each robot – which stops the robot when pressed – the system uses sensors to prevent it from crashing into people or objects.
The AI software built into RT-2 has its own safeguards that can be used to prevent the robot from doing anything harmful. Case in point: Google’s robots can be trained not to pick up containers with water in them, as water can damage their hardware if it spills.
If you’re the kind of person who worries about AI going rogue – and Hollywood has given us plenty of reasons to fear that scenario, from the original “Terminator” to last year’s “M3gan” – the idea of creating robots that being able to reason, plan on the fly and improvise probably seems like a terrible idea to you.
But at Google, it’s the kind of idea that researchers celebrate. After years in the wild, hardware robots are back – and they owe it to their chatbot brains.
Leave a Reply