Table of Contents
Fetching ...

GTA: Generative Trajectory Augmentation with Guidance for Offline Reinforcement Learning

Jaewoo Lee, Sujin Yun, Taeyoung Yun, Jinkyoo Park

TL;DR

GTA, a novel generative data augmentation approach designed to enrich offline data by augmenting trajectories to be both high-rewarding and dynamically plausible, and enhances the performance of widely used offline RL algorithms across various tasks with unique challenges.

Abstract

Offline Reinforcement Learning (Offline RL) presents challenges of learning effective decision-making policies from static datasets without any online interactions. Data augmentation techniques, such as noise injection and data synthesizing, aim to improve Q-function approximation by smoothing the learned state-action region. However, these methods often fall short of directly improving the quality of offline datasets, leading to suboptimal results. In response, we introduce GTA, Generative Trajectory Augmentation, a novel generative data augmentation approach designed to enrich offline data by augmenting trajectories to be both high-rewarding and dynamically plausible. GTA applies a diffusion model within the data augmentation framework. GTA partially noises original trajectories and then denoises them with classifier-free guidance via conditioning on amplified return value. Our results show that GTA, as a general data augmentation strategy, enhances the performance of widely used offline RL algorithms across various tasks with unique challenges. Furthermore, we conduct a quality analysis of data augmented by GTA and demonstrate that GTA improves the quality of the data. Our code is available at https://github.com/Jaewoopudding/GTA

GTA: Generative Trajectory Augmentation with Guidance for Offline Reinforcement Learning

TL;DR

GTA, a novel generative data augmentation approach designed to enrich offline data by augmenting trajectories to be both high-rewarding and dynamically plausible, and enhances the performance of widely used offline RL algorithms across various tasks with unique challenges.

Abstract

Offline Reinforcement Learning (Offline RL) presents challenges of learning effective decision-making policies from static datasets without any online interactions. Data augmentation techniques, such as noise injection and data synthesizing, aim to improve Q-function approximation by smoothing the learned state-action region. However, these methods often fall short of directly improving the quality of offline datasets, leading to suboptimal results. In response, we introduce GTA, Generative Trajectory Augmentation, a novel generative data augmentation approach designed to enrich offline data by augmenting trajectories to be both high-rewarding and dynamically plausible. GTA applies a diffusion model within the data augmentation framework. GTA partially noises original trajectories and then denoises them with classifier-free guidance via conditioning on amplified return value. Our results show that GTA, as a general data augmentation strategy, enhances the performance of widely used offline RL algorithms across various tasks with unique challenges. Furthermore, we conduct a quality analysis of data augmented by GTA and demonstrate that GTA improves the quality of the data. Our code is available at https://github.com/Jaewoopudding/GTA
Paper Structure (59 sections, 15 equations, 11 figures, 29 tables)

This paper contains 59 sections, 15 equations, 11 figures, 29 tables.

Figures (11)

  • Figure 1: Comparison of noise injection laskin2020reinforcementsinha2022s4rl, generative data augmentation lu2023synthetic and GTA.
  • Figure 2: Overall framework of the GTA comprises 3 major stages. In the first stage, we train a conditional diffusion model designed for generating trajectories. Following this, We perturb the original trajectory and subsequently denoise it using the trained diffusion model, conditioned by amplified return. Lastly, we employ the augmented dataset to train various offline RL algorithms.
  • Figure 3: Mechanism of the partial noising and denoising framework. The extent of exploration increases with $\mu$ ($\mu_1<\mu_2<\mu_3$). During denoising, amplified return guidance shifts trajectories towards the high-rewarding region.
  • Figure 4: (a), (b) D4RL normalized score across different noise levels over the course of training TD3BC on halfcheetah-medium-v2 and halfcheetah-medium-expert-v2. (c) Comparison oracle reward sum of subtrajectory between conditioning strategy on halfcheetah-medium-v2.
  • Figure 5: Data quality analysis of S4RL, SynthER, and GTA. GTA augmented data exhibits superior optimality and novelty across gym locomotion datasets while maintaining dynamic plausibility.
  • ...and 6 more figures