Waypoint-Based Reinforcement Learning for Robot Manipulation Tasks

Shaunak A. Mehta; Soheil Habibian; Dylan P. Losey

Waypoint-Based Reinforcement Learning for Robot Manipulation Tasks

Shaunak A. Mehta, Soheil Habibian, Dylan P. Losey

TL;DR

A waypoint-based approach for model-free reinforcement learning where the robot now learns a trajectory of waypoints, and then interpolates between those waypoints using existing controllers, and theoretically shows that an ideal solution to this reformulation has lower regret bounds than standard frameworks.

Abstract

Robot arms should be able to learn new tasks. One framework here is reinforcement learning, where the robot is given a reward function that encodes the task, and the robot autonomously learns actions to maximize its reward. Existing approaches to reinforcement learning often frame this problem as a Markov decision process, and learn a policy (or a hierarchy of policies) to complete the task. These policies reason over hundreds of fine-grained actions that the robot arm needs to take: e.g., moving slightly to the right or rotating the end-effector a few degrees. But the manipulation tasks that we want robots to perform can often be broken down into a small number of high-level motions: e.g., reaching an object or turning a handle. In this paper we therefore propose a waypoint-based approach for model-free reinforcement learning. Instead of learning a low-level policy, the robot now learns a trajectory of waypoints, and then interpolates between those waypoints using existing controllers. Our key novelty is framing this waypoint-based setting as a sequence of multi-armed bandits: each bandit problem corresponds to one waypoint along the robot's motion. We theoretically show that an ideal solution to this reformulation has lower regret bounds than standard frameworks. We also introduce an approximate posterior sampling solution that builds the robot's motion one waypoint at a time. Results across benchmark simulations and two real-world experiments suggest that this proposed approach learns new tasks more quickly than state-of-the-art baselines. See videos here: https://youtu.be/MMEd-lYfq4Y

Waypoint-Based Reinforcement Learning for Robot Manipulation Tasks

TL;DR

Abstract

Paper Structure (11 sections, 4 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 11 sections, 4 equations, 3 figures, 1 table, 1 algorithm.

Introduction
Related Work
Problem Formulation
Reinforcement Learning with Sequential Waypoints
Reformulation as a Sequence of Multi-Armed Bandits
Assumptions
Lower Bounds on Regret
Approximate Solution with Posterior Sampling
Benchmark Simulations
Real-World Experiments
Conclusion

Figures (3)

Figure 1: Our waypoint-based approach for model-free reinforcement learning in manipulation tasks. The robot arm learns where to place the next waypoint to maximize its reward by solving a multi-armed bandit. We then freeze the learned models for waypoint $i$, and repeat the process for waypoint $i+1$. This approach learns the desired task across a distribution of initial states; i.e., the location and angle of the drawers can change at each interaction.
Figure 2: Simulation environments and rewards for six manipulation tasks. These benchmark tasks are taken from robosuite robosuite2020. For each task the image on the left shows the environment setup, and the plot on the right shows the robot's rewards averaged over five runs. Higher rewards indicate better task performance. The dashed lines in the reward plots correspond to the episodes where Our approach added a new waypoint to the trajectory. Adding these new waypoints often causes a sharp increase in the robot's reward. This sudden change occurs because the new waypoint enables the robot to complete the next part of the task: e.g., in Lift the robot needs one waypoint to grasp the block and a second waypoint to lift the block.
Figure 3: Setup and results from our real-world experiments in Section \ref{['sec:experiment']}. In Lift the robot learned to pick up an item, and in Drawer it learned to open a drawer (also see Figure \ref{['fig:front']}). The robot measured the initial position of the item or drawer as a part of the start state $s^0$. SAC learned a policy that often moved to the object of interest, but did not correctly interact with that object. When using Ours, the robot built a trajectory of two waypoints: the first waypoint grasped the object, and then the second waypoint interacted with that object (e.g., picked the item up, pulled the drawer open). The dashed line in the reward plots corresponds to the episode where Our robot added a second waypoint to its trajectory.

Waypoint-Based Reinforcement Learning for Robot Manipulation Tasks

TL;DR

Abstract

Waypoint-Based Reinforcement Learning for Robot Manipulation Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (3)