Table of Contents
Fetching ...

Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning

Nicholas E. Corrado, Yuxiao Qu, John U. Balis, Adam Labiosa, Josiah P. Hanna

TL;DR

Offline RL and imitation learning suffer when data is scarce and policies extrapolate poorly beyond the dataset. GuDA introduces Guided Data Augmentation, which uses simple, domain-informed sampling rules over four invariance-based DAFs to produce expert-quality augmented data from limited demonstrations. The approach yields consistent, substantial improvements across simulated navigation, driving, and soccer tasks, and even outperforms baselines in a challenging physical robot soccer task. By integrating human judgment on progress into data augmentation, GuDA enhances data efficiency and policy quality without requiring additional environment interaction.

Abstract

In offline reinforcement learning (RL), an RL agent learns to solve a task using only a fixed dataset of previously collected data. While offline RL has been successful in learning real-world robot control policies, it typically requires large amounts of expert-quality data to learn effective policies that generalize to out-of-distribution states. Unfortunately, such data is often difficult and expensive to acquire in real-world tasks. Several recent works have leveraged data augmentation (DA) to inexpensively generate additional data, but most DA works apply augmentations in a random fashion and ultimately produce highly suboptimal augmented experience. In this work, we propose Guided Data Augmentation (GuDA), a human-guided DA framework that generates expert-quality augmented data. The key insight behind GuDA is that while it may be difficult to demonstrate the sequence of actions required to produce expert data, a user can often easily characterize when an augmented trajectory segment represents progress toward task completion. Thus, a user can restrict the space of possible augmentations to automatically reject suboptimal augmented data. To extract a policy from GuDA, we use off-the-shelf offline reinforcement learning and behavior cloning algorithms. We evaluate GuDA on a physical robot soccer task as well as simulated D4RL navigation tasks, a simulated autonomous driving task, and a simulated soccer task. Empirically, GuDA enables learning given a small initial dataset of potentially suboptimal experience and outperforms a random DA strategy as well as a model-based DA strategy.

Guided Data Augmentation for Offline Reinforcement Learning and Imitation Learning

TL;DR

Offline RL and imitation learning suffer when data is scarce and policies extrapolate poorly beyond the dataset. GuDA introduces Guided Data Augmentation, which uses simple, domain-informed sampling rules over four invariance-based DAFs to produce expert-quality augmented data from limited demonstrations. The approach yields consistent, substantial improvements across simulated navigation, driving, and soccer tasks, and even outperforms baselines in a challenging physical robot soccer task. By integrating human judgment on progress into data augmentation, GuDA enhances data efficiency and policy quality without requiring additional environment interaction.

Abstract

In offline reinforcement learning (RL), an RL agent learns to solve a task using only a fixed dataset of previously collected data. While offline RL has been successful in learning real-world robot control policies, it typically requires large amounts of expert-quality data to learn effective policies that generalize to out-of-distribution states. Unfortunately, such data is often difficult and expensive to acquire in real-world tasks. Several recent works have leveraged data augmentation (DA) to inexpensively generate additional data, but most DA works apply augmentations in a random fashion and ultimately produce highly suboptimal augmented experience. In this work, we propose Guided Data Augmentation (GuDA), a human-guided DA framework that generates expert-quality augmented data. The key insight behind GuDA is that while it may be difficult to demonstrate the sequence of actions required to produce expert data, a user can often easily characterize when an augmented trajectory segment represents progress toward task completion. Thus, a user can restrict the space of possible augmentations to automatically reject suboptimal augmented data. To extract a policy from GuDA, we use off-the-shelf offline reinforcement learning and behavior cloning algorithms. We evaluate GuDA on a physical robot soccer task as well as simulated D4RL navigation tasks, a simulated autonomous driving task, and a simulated soccer task. Empirically, GuDA enables learning given a small initial dataset of potentially suboptimal experience and outperforms a random DA strategy as well as a model-based DA strategy.
Paper Structure (27 sections, 4 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 4 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: An overview of GuDA applied to a parking task given DAFs that translate and rotate a trajectory segment $\tau$. A user first defines a sampling procedure describing how to translate and rotate $\tau$ to produce expert-quality data: translate $\tau$ so that the agent's final position is at the parking spot, and then rotate $\tau$ such that the agent is aligned with the parking spot. We augment our dataset using this sampling procedure and then learn a policy with offline RL or imitation learning.
  • Figure 2: GuDA translates trajectory segments $\tau_\text{up}, \tau_\text{right}$ to demonstrate the agent walking to the goal. A random translation (bottom right) may be highly suboptimal.
  • Figure 3: IQM normalized returns over 10 independent runs with 95% stratified bootstrap confidence intervals for different DA strategies and algorithms. We compute normalized returns computed as $= 100 \cdot \frac{R - R_\text{random}}{R_\text{expert} - R_\text{random}}$ where $R_\text{expert}$ and $R_\text{random}$ denote the average return of the demonstrator and a policy that chooses actions uniformly at random, respectively, computed over 100 trajectories.
  • Figure 4: Example augmentations under GuDA. The original trajectory segment is shown in yellow.
  • Figure 5: (\ref{['fig:init_1']}, \ref{['fig:init_2']}) Task initializations. (\ref{['fig:demo']}) Initial data with relevant segments $\tau_1$ and $\tau_2$. (\ref{['fig:guda_demo']}) An illustration of GuDA data generated by translating, rotating, and/or reflecting $\tau_1$ and $\tau_2$.
  • ...and 5 more figures