Table of Contents
Fetching ...

Simple Ingredients for Offline Reinforcement Learning

Edoardo Cetin, Andrea Tirinzoni, Matteo Pirotta, Alessandro Lazaric, Yann Ollivier, Ahmed Touati

TL;DR

It is shown that simple methods like AWAC and IQL with increased network size overcome the paradoxical failure modes from the inclusion of additional data in MOOD, and notably outperform prior state-of-the-art algorithms on the canonical D4RL benchmark.

Abstract

Offline reinforcement learning algorithms have proven effective on datasets highly connected to the target downstream task. Yet, leveraging a novel testbed (MOOD) in which trajectories come from heterogeneous sources, we show that existing methods struggle with diverse data: their performance considerably deteriorates as data collected for related but different tasks is simply added to the offline buffer. In light of this finding, we conduct a large empirical study where we formulate and test several hypotheses to explain this failure. Surprisingly, we find that scale, more than algorithmic considerations, is the key factor influencing performance. We show that simple methods like AWAC and IQL with increased network size overcome the paradoxical failure modes from the inclusion of additional data in MOOD, and notably outperform prior state-of-the-art algorithms on the canonical D4RL benchmark.

Simple Ingredients for Offline Reinforcement Learning

TL;DR

It is shown that simple methods like AWAC and IQL with increased network size overcome the paradoxical failure modes from the inclusion of additional data in MOOD, and notably outperform prior state-of-the-art algorithms on the canonical D4RL benchmark.

Abstract

Offline reinforcement learning algorithms have proven effective on datasets highly connected to the target downstream task. Yet, leveraging a novel testbed (MOOD) in which trajectories come from heterogeneous sources, we show that existing methods struggle with diverse data: their performance considerably deteriorates as data collected for related but different tasks is simply added to the offline buffer. In light of this finding, we conduct a large empirical study where we formulate and test several hypotheses to explain this failure. Surprisingly, we find that scale, more than algorithmic considerations, is the key factor influencing performance. We show that simple methods like AWAC and IQL with increased network size overcome the paradoxical failure modes from the inclusion of additional data in MOOD, and notably outperform prior state-of-the-art algorithms on the canonical D4RL benchmark.
Paper Structure (27 sections, 10 equations, 11 figures, 21 tables, 4 algorithms)

This paper contains 27 sections, 10 equations, 11 figures, 21 tables, 4 algorithms.

Figures (11)

  • Figure 1: The AWAC algorithm learns to stand when trained on data generated by an agent learning to either stand, walk, or run, but completely fails on the union of these three datasets.
  • Figure 2: Average performance on the same- and mixed-objective datasets from MOOD (left), and the locomotion and antmaze datasets from D4RL (right). The large networks are simple MLPs for MOOD and modern architectures deeper-deep-RL for D4RL, and all involve an ensemble of 5 critics (Sec. \ref{['sec:4conjectures']}). "AW" in the last plot denotes the sampling strategy of HongACL23harnessing for unbalanced data.
  • Figure 3: Optimal advantage-weighted distribution $\pi_\mathcal{B}^\star$ (shaded areas) and its Gaussian projection (solid curves) after 1K and 1M optimization steps of AWAC on cheetah run with the same-objective (left) and the mixed-objective (right) datasets. Dashed lines indicate the actions chosen during evaluation using either the mean of the learned policy (black) or ES (blue) after 1M steps. Distributions are plotted for a randomly-chosen state and action dimension.
  • Figure 4: Empirical bias and variance of the AWAC's and ASAC's objective estimators for different batch sizes $n$ on cheetah run. Each point is averaged over 1000 randomly sampled minibatches of size $n$ at equally-spaced checkpoints saved during training.
  • Figure 5: Continuous control environments in MOOD.
  • ...and 6 more figures