Table of Contents
Fetching ...

AdaMemento: Adaptive Memory-Assisted Policy Optimization for Reinforcement Learning

Renye Yan, Yaozhong Gan, You Wu, Junliang Xing, Ling Liangn, Yeshang Zhu, Yimao Cai

TL;DR

AdaMemento, an adaptive memory-enhanced RL framework that exploits both positive and negative experiences by learning to predict known local optimal policies based on real-time states, is proposed and theoretically proves the superiority of the new intrinsic motivation and ensemble mechanism.

Abstract

In sparse reward scenarios of reinforcement learning (RL), the memory mechanism provides promising shortcuts to policy optimization by reflecting on past experiences like humans. However, current memory-based RL methods simply store and reuse high-value policies, lacking a deeper refining and filtering of diverse past experiences and hence limiting the capability of memory. In this paper, we propose AdaMemento, an adaptive memory-enhanced RL framework. Instead of just memorizing positive past experiences, we design a memory-reflection module that exploits both positive and negative experiences by learning to predict known local optimal policies based on real-time states. To effectively gather informative trajectories for the memory, we further introduce a fine-grained intrinsic motivation paradigm, where nuances in similar states can be precisely distinguished to guide exploration. The exploitation of past experiences and exploration of new policies are then adaptively coordinated by ensemble learning to approach the global optimum. Furthermore, we theoretically prove the superiority of our new intrinsic motivation and ensemble mechanism. From 59 quantitative and visualization experiments, we confirm that AdaMemento can distinguish subtle states for better exploration and effectively exploiting past experiences in memory, achieving significant improvement over previous methods.

AdaMemento: Adaptive Memory-Assisted Policy Optimization for Reinforcement Learning

TL;DR

AdaMemento, an adaptive memory-enhanced RL framework that exploits both positive and negative experiences by learning to predict known local optimal policies based on real-time states, is proposed and theoretically proves the superiority of the new intrinsic motivation and ensemble mechanism.

Abstract

In sparse reward scenarios of reinforcement learning (RL), the memory mechanism provides promising shortcuts to policy optimization by reflecting on past experiences like humans. However, current memory-based RL methods simply store and reuse high-value policies, lacking a deeper refining and filtering of diverse past experiences and hence limiting the capability of memory. In this paper, we propose AdaMemento, an adaptive memory-enhanced RL framework. Instead of just memorizing positive past experiences, we design a memory-reflection module that exploits both positive and negative experiences by learning to predict known local optimal policies based on real-time states. To effectively gather informative trajectories for the memory, we further introduce a fine-grained intrinsic motivation paradigm, where nuances in similar states can be precisely distinguished to guide exploration. The exploitation of past experiences and exploration of new policies are then adaptively coordinated by ensemble learning to approach the global optimum. Furthermore, we theoretically prove the superiority of our new intrinsic motivation and ensemble mechanism. From 59 quantitative and visualization experiments, we confirm that AdaMemento can distinguish subtle states for better exploration and effectively exploiting past experiences in memory, achieving significant improvement over previous methods.
Paper Structure (48 sections, 2 theorems, 19 equations, 15 figures, 1 table, 1 algorithm)

This paper contains 48 sections, 2 theorems, 19 equations, 15 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.2

After $k$ updates of the coarse-fine distinction network, under Assumption assu1, the optimal action remains the same after adding the intrinsic rewards. That is, for any state $s$, we have where $Q^*_1$ is the optimal $Q$ function after adding the intrinsic rewards.

Figures (15)

  • Figure 1: Different granularity of state discrimination. (a) versus (b) represents fine-grained distinction, where the state images look similar but are of completely different importance, which is not well addressed in previous research.
  • Figure 2: The figure shows that the left side represents past experience trajectories stored in the memory buffer. AdaMemento learns to avoid danger and continues updating the current optimal strategy by synthesizing and reflecting on the commonalities in these trajectories. The updated strategy is illustrated on the right side.
  • Figure 3: AdaMemento's framework. We evaluate each sub-module in (a) and parameters in (b) and (c).
  • Figure 4: Comparison in Montezuma's Revenge Environment. (a) illustrates a comparison between baseline methods before and after integration with our AdaMemento; (b) presents a performance comparison to other advanced baseline models.
  • Figure 5: Generalization experiments in discrete-space environments (Atari). The x-axis represents timesteps in 10 million.
  • ...and 10 more figures

Theorems & Definitions (4)

  • Theorem 3.2
  • Theorem 3.3
  • proof
  • proof