Table of Contents
Fetching ...

ATraDiff: Accelerating Online Reinforcement Learning with Imaginary Trajectories

Qianlan Yang, Yu-Xiong Wang

TL;DR

ATraDiff tackles the long-standing data-efficiency problem in online reinforcement learning with sparse rewards by learning a diffusion-based generator from offline data to synthesize full trajectories. It introduces a trajectory-centric diffusion model capable of state- and image-level generation, plus a coarse-to-precise length strategy and an online adaptation loop to counter distribution shifts. By augmenting the replay buffer with synthetic trajectories and adapting the generator during online learning, ATraDiff achieves state-of-the-art performance across online, offline-to-online, and offline RL benchmarks, often outperforming transition-focused data augmentation methods. The approach is designed as a general, plug-in enhancement for any replay-buffer-based RL algorithm, offering substantial improvements especially in complex tasks and environments with distribution shifts.

Abstract

Training autonomous agents with sparse rewards is a long-standing problem in online reinforcement learning (RL), due to low data efficiency. Prior work overcomes this challenge by extracting useful knowledge from offline data, often accomplished through the learning of action distribution from offline data and utilizing the learned distribution to facilitate online RL. However, since the offline data are given and fixed, the extracted knowledge is inherently limited, making it difficult to generalize to new tasks. We propose a novel approach that leverages offline data to learn a generative diffusion model, coined as Adaptive Trajectory Diffuser (ATraDiff). This model generates synthetic trajectories, serving as a form of data augmentation and consequently enhancing the performance of online RL methods. The key strength of our diffuser lies in its adaptability, allowing it to effectively handle varying trajectory lengths and mitigate distribution shifts between online and offline data. Because of its simplicity, ATraDiff seamlessly integrates with a wide spectrum of RL methods. Empirical evaluation shows that ATraDiff consistently achieves state-of-the-art performance across a variety of environments, with particularly pronounced improvements in complicated settings. Our code and demo video are available at https://atradiff.github.io .

ATraDiff: Accelerating Online Reinforcement Learning with Imaginary Trajectories

TL;DR

ATraDiff tackles the long-standing data-efficiency problem in online reinforcement learning with sparse rewards by learning a diffusion-based generator from offline data to synthesize full trajectories. It introduces a trajectory-centric diffusion model capable of state- and image-level generation, plus a coarse-to-precise length strategy and an online adaptation loop to counter distribution shifts. By augmenting the replay buffer with synthetic trajectories and adapting the generator during online learning, ATraDiff achieves state-of-the-art performance across online, offline-to-online, and offline RL benchmarks, often outperforming transition-focused data augmentation methods. The approach is designed as a general, plug-in enhancement for any replay-buffer-based RL algorithm, offering substantial improvements especially in complex tasks and environments with distribution shifts.

Abstract

Training autonomous agents with sparse rewards is a long-standing problem in online reinforcement learning (RL), due to low data efficiency. Prior work overcomes this challenge by extracting useful knowledge from offline data, often accomplished through the learning of action distribution from offline data and utilizing the learned distribution to facilitate online RL. However, since the offline data are given and fixed, the extracted knowledge is inherently limited, making it difficult to generalize to new tasks. We propose a novel approach that leverages offline data to learn a generative diffusion model, coined as Adaptive Trajectory Diffuser (ATraDiff). This model generates synthetic trajectories, serving as a form of data augmentation and consequently enhancing the performance of online RL methods. The key strength of our diffuser lies in its adaptability, allowing it to effectively handle varying trajectory lengths and mitigate distribution shifts between online and offline data. Because of its simplicity, ATraDiff seamlessly integrates with a wide spectrum of RL methods. Empirical evaluation shows that ATraDiff consistently achieves state-of-the-art performance across a variety of environments, with particularly pronounced improvements in complicated settings. Our code and demo video are available at https://atradiff.github.io .
Paper Structure (26 sections, 2 equations, 13 figures, 5 tables, 1 algorithm)

This paper contains 26 sections, 2 equations, 13 figures, 5 tables, 1 algorithm.

Figures (13)

  • Figure 1: Illustration and performance showcase of our ATraDiff. ATraDiff can seamlessly integrate with a wide range of RL methods and consistently improve their performance, by augmenting the replay buffer with synthesized trajectories. Top: Overview of online RL with ATraDiff. Bottom: Performance comparison of RL methods with and without ATraDiff in D4RL Kitchen.
  • Figure 2: Illustrative overview of our ATraDiff framework. Left: A diffuser containing multiple diffusion models, a length estimator, and a trajectory pruner. Right: Workflow of the online adaptation.
  • Figure 3: Learning curves of online RL on the D4RL Locomotion benchmark. ATraDiff (denoted as 'w/') consistently and significantly improves the performance of the two representative RL methods across all three environments, irrespective of whether basic or advanced algorithms are employed. ATraDiff also outperforms SynthER which synthesizes transitions. These results validate the effectiveness and generalizability of our diffuser.
  • Figure 4: Learning curves of offline-to-online RL on the D4RL benchmark. ATraDiff (denoted as 'w/') further boosts the performance of advanced and recent offline-to-online RL baselines across all three environments, leading to state-of-the-art results especially in complex settings, where the improvements are particularly noteworthy. This shows the importance of our online adapted diffuser. The curve named "SynthER" shows the best performance of SynthER combined with any of the baselines.
  • Figure 5: Learning curves of offline-to-online on the Meta-World benchmark. While the two tasks within the Meta-World environment are designed purposefully to be very changeling with considerable distribution shifts, ATraDiff (denoted as 'w/') is still effective and significantly improves the performance of advanced and recent offline-to-online RL baselines. This further validates the strength of ATraDiff in tacking distribution shifts between offline data and online tasks. The curve named "SynthER" shows the best performance of SynthER combined with any of the baselines.
  • ...and 8 more figures