Google makes robots learn new skills faster by sharing their experience

Google Brain Team and their subsidiaries DeepMind and [X] have revealed the learning method for robots to end all learning methods.

Humanity has an entire list of activities that we cannot wait to delegate to robots. Some are dangerous, others are unpleasant, and some are simply better suited for a mechanical mind. But even a task as simple as assisting the elderly with chores and daily activities is incredibly complex for a robot. If we think about it, we will find that that the most mundane task we can think of actually relies on a lot of decision-making and previous experience.
In other words, it takes a lot of work for a robot to learn a skill that we wouldn’t normally describe as robotic. Reason being that a robot relies on its own experience to hone its skill, which takes an impractical amount of time, no matter how sophisticated its learning algorithm is. This is especially true if motor skills are involved. We are naturally good at integrating our senses, reflexes, and muscles in a closely coordinated feedback loop. “Naturally”, because our behaviors are well-honed for the variability and complexity of the environment. Not the case for robots.
Thankfully, Sergey Levine (Google Brain Team), Timothy Lilicrap (DeepMind) and Mrinal Kalahrishnan (X) have developed and demonstrated a method that allows robots to learn from each other’s experiences. By enabling them to learn collectively, they gain more experience quicker.
These robots instantaneously transmit their experience to other robots over the network – sometimes known as “cloud robotics” – and it is this ability that can let them learn from each other to perform motion skills in close coordination with sensing in realistic environments.
Researchers have performed three experiments designed to investigate three possible approaches for general-purpose skill learning across multiple robots: learning motion skills directly from experience, learning internal models of physics, and learning skills with human assistance. In all three cases, multiple robots shared their experiences to build a common model of the skill.
Learning from raw experience with model-free reinforcement learning
Trial-and-error learning is very popular among humans and animals, and can actually be extended well to robots. This kind of learning is called “model-free” because there is no explicit model of environment formed – they explore variations on their existing behavior and then reinforce and exploite the variations that give bigger rewards. In combination with deep neural networks, model-free algorithms have been key to sucess in the past, most notably in the game of Go. Having multiple robots learn this way speeds up the process significantly.
In these experiments, robots were tasked with trying to move their arms to goal locations, or reaching to and opening a door. Each robot has a copy of a neural network that allows it to estimate the value of taking a given action in a given state. By querying this network, the robot can quickly decide what actions might be worth taking in the world. When a robot acts, noise is added to the actions it selects, so the resulting behavior is sometimes a bit better than previously observed, and sometimes a bit worse. This allows each robot to explore different ways of approaching a task. Records of the actions taken by the robots, their behaviors, and the final outcomes are sent back to a central server. The server collects the experiences from all of the robots and uses them to iteratively improve the neural network that estimates value for different states and actions. The model-free algorithms we employed look across both good and bad experiences and distill these into a new network that is better at understanding how action and success are related. Then, at regular intervals, each robot takes a copy of the updated network from the server and begins to act using the information in its new network. Given that this updated network is a bit better at estimating the true value of actions in the world, the robots will produce better behavior. This cycle can then be repeated to continue improving on the task. In the video below, a robot explores the door opening task.

With a few hours of practice, robots sharing their raw experience learn to make reaches to targets, and to open a door by making contact with the handle and pulling. In the case of door opening, the robots learn to deal with the complex physics of the contacts between the hook and the door handle without building an explicit model of the world, as can be seen in the example below:

Learning how the world works by interacting with objects
Direct trial-and-error reinforcement learning is a great way to learn individual skills. However, humans and animals don’t learn exclusively by trial and error. We also build mental models about our environment and imagine how the world might change in response to our actions.
We can start with the simplest of physical interactions, and have our robots learn the basics of cause and effect from reflecting on their own experiences. In this experiment, researchers had the robots play with a wide variety of common household objects by randomly prodding and pushing them inside a tabletop bin. The robots again shared their experiences with each other and together built a single predictive model that attempted to forecast what the world might look like in response to their actions. This predictive model can make simple, if slightly blurry, forecasts about future camera images when provided with the current image and a possible sequence of actions that the robot might execute:

Top row: robotic arms interacting with common household items. Bottom row: Predicted future camera images given an initial image and a sequence of actions.

Once this model is trained, the robots can use it to perform purposeful manipulations, for example based on user commands. In this prototype, a user can command the robot to move a particular object simply by clicking on that object, and then clicking on the point where the object should go:

The robots in this experiment were not told anything about objects or physics: they only see that the command requires a particular pixel to be moved to a particular place. However, because they have seen so many object interactions in their shared past experiences, they can forecast how particular actions will affect particular pixels. In order for such an implicit understanding of physics to emerge, the robots must be provided with a sufficient breadth of experience. This requires either a lot of time, or sharing the combined experiences of many robots.

Learning with the help of humans
Robots can learn entirely on their own, but human guidance is important, not just for telling the robot what to do, but also for helping the robots along. We have a lot of intuition about how various manipulation skills can be performed, and it only seems natural that transferring this intuition to robots can help them learn these skills a lot faster. In the next experiment, each robot was provided with a different door, and guided each of them by hand to show how these doors can be opened. These demonstrations are encoded into a single combined strategy for all robots, called a policy. The policy is a deep neural network which converts camera images to robot actions, and is maintained on a central server. The following video shows the instructor demonstrating the door-opening skill to a robot:

Next, the robots collectively improve this policy through a trial-and-error learning process. Each robot attempts to open its own door using the latest available policy, with some added noise for exploration. These attempts allow each robot to plan a better strategy for opening the door the next time around, and improve the policy accordingly:

Not surprisingly, robots learn more effectively if they are trained on a curriculum of tasks that are gradually increasing in difficulty. In the experiment, each robot starts off by practicing the door-opening skill on a specific position and orientation of the door that the instructor had previously shown it. As it gets better at performing the task, the instructor starts to alter the position and orientation of the door to be just a bit beyond the current capabilities of the policy, but not so difficult that it fails entirely. This allows the robots to gradually increase their skill level over time, and expands the range of situations they can handle. The combination of human-guidance with trial-and-error learning allowed the robots to collectively learn the skill of door-opening in just a couple of hours. Since the robots were trained on doors that look different from each other, the final policy succeeds on a door with a handle that none of the robots had seen before:

These are relatively simple tasks, involving relatively simple skills, but the method is essential, and possibly a key to enabling robots to assist us in our daily lives. Maybe start thinking about which chores you would like to get rid of first. Maybe consider how easily a robot could replace you at your job. Learn a new skill or two.
All media and experiment descriptions courtesy of Sergey Levine, Timothy Lillicrap, and Mrinal Kalakrishnan, Google Research Blog.