Table of Contents
Fetching ...

A Workflow for Offline Model-Free Robotic Reinforcement Learning

Aviral Kumar, Anikait Singh, Stephen Tian, Chelsea Finn, Sergey Levine

TL;DR

The aim is to develop a practical workflow for using offline RL analogous to the relatively well-understood workflows for supervised learning problems, and to devise a set of metrics and conditions that can be tracked over the course of offline training, and can inform the practitioner about how the algorithm and model architecture should be adjusted to improve final performance.

Abstract

Offline reinforcement learning (RL) enables learning control policies by utilizing only prior experience, without any online interaction. This can allow robots to acquire generalizable skills from large and diverse datasets, without any costly or unsafe online data collection. Despite recent algorithmic advances in offline RL, applying these methods to real-world problems has proven challenging. Although offline RL methods can learn from prior data, there is no clear and well-understood process for making various design choices, from model architecture to algorithm hyperparameters, without actually evaluating the learned policies online. In this paper, our aim is to develop a practical workflow for using offline RL analogous to the relatively well-understood workflows for supervised learning problems. To this end, we devise a set of metrics and conditions that can be tracked over the course of offline training, and can inform the practitioner about how the algorithm and model architecture should be adjusted to improve final performance. Our workflow is derived from a conceptual understanding of the behavior of conservative offline RL algorithms and cross-validation in supervised learning. We demonstrate the efficacy of this workflow in producing effective policies without any online tuning, both in several simulated robotic learning scenarios and for three tasks on two distinct real robots, focusing on learning manipulation skills with raw image observations with sparse binary rewards. Explanatory video and additional results can be found at sites.google.com/view/offline-rl-workflow

A Workflow for Offline Model-Free Robotic Reinforcement Learning

TL;DR

The aim is to develop a practical workflow for using offline RL analogous to the relatively well-understood workflows for supervised learning problems, and to devise a set of metrics and conditions that can be tracked over the course of offline training, and can inform the practitioner about how the algorithm and model architecture should be adjusted to improve final performance.

Abstract

Offline reinforcement learning (RL) enables learning control policies by utilizing only prior experience, without any online interaction. This can allow robots to acquire generalizable skills from large and diverse datasets, without any costly or unsafe online data collection. Despite recent algorithmic advances in offline RL, applying these methods to real-world problems has proven challenging. Although offline RL methods can learn from prior data, there is no clear and well-understood process for making various design choices, from model architecture to algorithm hyperparameters, without actually evaluating the learned policies online. In this paper, our aim is to develop a practical workflow for using offline RL analogous to the relatively well-understood workflows for supervised learning problems. To this end, we devise a set of metrics and conditions that can be tracked over the course of offline training, and can inform the practitioner about how the algorithm and model architecture should be adjusted to improve final performance. Our workflow is derived from a conceptual understanding of the behavior of conservative offline RL algorithms and cross-validation in supervised learning. We demonstrate the efficacy of this workflow in producing effective policies without any online tuning, both in several simulated robotic learning scenarios and for three tasks on two distinct real robots, focusing on learning manipulation skills with raw image observations with sparse binary rewards. Explanatory video and additional results can be found at sites.google.com/view/offline-rl-workflow

Paper Structure

This paper contains 27 sections, 1 theorem, 20 equations, 22 figures, 3 tables.

Key Result

Theorem A.1

When running CQL using updates in Equation eqn:modified_cql in a tabular setting, using the policy $\pi^k(\mathbf{a}|\mathbf{s}) = \mathbb{\delta}[\mathbf{a} = \arg\max_{\mathbf{a}'} Q^k(\mathbf{s}, \mathbf{a}')]$, the expected Q-value on the dataset, i.e., $f(k) := \mathbb{E}_{\mathbf{s}, \mathbf{a

Figures (22)

  • Figure 1: Our proposed workflow aims to detect overfitting and underfitting, and provides guidelines for addressing these issues via policy selection, regularization, and architecture design. We evaluate this workflow on two real-world robotic systems and simulation domains, and we find it to be effective.
  • Figure 2: Simulated domains singh2020cog. Examples of trajectories for the (Top) pick and place task and (Bottom) grasping from drawer task.
  • Figure 3: Policy performance (Top) and average dataset Q-values of CQL (bottom) with varying number of trajectories. Vertical bands indicate regions around the peak in average Q-value and observe that these regions correspond to policies with good actual performance.
  • Figure 4: Left: CQL regularizer attains low values, especially with 50 and 100 trajectories in the pick and place task, Right: Using VIB mitigates overfitting, giving rise to a stable trend in Q-values and better performance which does not degrade with more training steps.
  • Figure 5: Performance (left), TD error (middle) and average dataset Q-values (right) for the pick and place task with a variable number of objects. Note that while the learned Q-values increase and stabilize, the TD error values in scenarios with more than 10 objects are large (1.0-2.0). Correspondingly, the performance generally decreases as the number of objects increases.
  • ...and 17 more figures

Theorems & Definitions (2)

  • Theorem A.1: Characterizing a decreasing trend in Q-values
  • proof