Bayesian Optimization for Sample-Efficient Policy Improvement in Robotic Manipulation

Adrian Röfer; Iman Nematollahi; Tim Welschehold; Wolfram Burgard; Abhinav Valada

Bayesian Optimization for Sample-Efficient Policy Improvement in Robotic Manipulation

Adrian Röfer, Iman Nematollahi, Tim Welschehold, Wolfram Burgard, Abhinav Valada

TL;DR

BOpt-GMM is introduced, a hybrid approach that combines imitation learning with own experience collection that demonstrates the sample efficiency of this approach on multiple complex manipulation skills in both simulations and real-world experiments.

Abstract

Sample efficient learning of manipulation skills poses a major challenge in robotics. While recent approaches demonstrate impressive advances in the type of task that can be addressed and the sensing modalities that can be incorporated, they still require large amounts of training data. Especially with regard to learning actions on robots in the real world, this poses a major problem due to the high costs associated with both demonstrations and real-world robot interactions. To address this challenge, we introduce BOpt-GMM, a hybrid approach that combines imitation learning with own experience collection. We first learn a skill model as a dynamical system encoded in a Gaussian Mixture Model from a few demonstrations. We then improve this model with Bayesian optimization building on a small number of autonomous skill executions in a sparse reward setting. We demonstrate the sample efficiency of our approach on multiple complex manipulation skills in both simulations and real-world experiments. Furthermore, we make the code and pre-trained models publicly available at http://bopt-gmm. cs.uni-freiburg.de.

Bayesian Optimization for Sample-Efficient Policy Improvement in Robotic Manipulation

TL;DR

Abstract

Paper Structure (12 sections, 10 equations, 5 figures, 3 tables)

This paper contains 12 sections, 10 equations, 5 figures, 3 tables.

Introduction
Related Work
Problem Formulation
BOpt-GMM Framework
GMM
GMM Parameterization
Bayesian Optimization
Experimental Evaluation
Experiment Setup
Experiments in Simulation
Real World Experiments
Conclusion

Figures (5)

Figure 1: We propose a simple but effective interpretation of a reinforcement learning problem as black-box optimization of a policy. The policy, encoded as a GMM, is evaluated to measure its accuracy. From this new measurement, the optimizer can regress a new improved update.
Figure 2: Our approach consists of two parts: A Bayesian optimizer estimating the value $p\left(\Delta\theta \;\middle\vert\; D\right)$ and proposing potential new updates $\Delta\theta_i$. The second part is the evaluation function $h(\Delta\theta_i, j)$ which plays the update $\Delta\theta$ for $j$ steps and averages the returns. The results are used to inform the optimizer.
Figure 3: (a) Simulated sliding of a horizontal hatch. The location and orientation of the hatch's frame are varied between episodes. (b) Simulated Drawer Opening. The location of the cabinet is varied in the XY plane. (c) Simulated opening of a door. The location of the door is varied per episode. The handle must be pressed to move the door. Real opening of a (d) sliding door, (e) drawer, (f) door.
Figure 4: Comparison of the mean performances of GMM, SAC-GMM, BOpt-GMM, and Online-GMM baseline in our three simulated scenarios in Fig. \ref{['fig:ex_scenes']}). $k$ indicates the number of GMM components. We run each method for 500 episodes. We find BOpt-GMM and SAC-GMM to improve significantly over the initial GMM, while Online-GMM does not do so reliably, or even deteriorates performance. Additionally, we introduce $BC$ trained on the same demonstrations as the GMM, and $BC_{100}$ trained on a full $100$ demonstrations. The latter achieves recognizable but not comparable performance.
Figure 5: To illustrate the significance of the difference in sampling efficiency, we overlay the evolution of model performances of the three basic updates. The dashed red line shows the performance of the base GMM. Note: BOpt-GMM is only evaluated when a new incumbent is generated, while SAC-GMM is evaluated at regular intervals. Hence the different graph lengths.

Bayesian Optimization for Sample-Efficient Policy Improvement in Robotic Manipulation

TL;DR

Abstract

Bayesian Optimization for Sample-Efficient Policy Improvement in Robotic Manipulation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)