Table of Contents
Fetching ...

Revisiting Experience Replayable Conditions

Taisuke Kobayashi

TL;DR

It is confirmed through numerical simulations that the proposed stabilization tricks make ER applicable to an advantage actor-critic, an on-policy algorithm, and its learning performance is comparable to that of a soft actor-critic, a state-of-the-art off-policy algorithm.

Abstract

Experience replay (ER) used in (deep) reinforcement learning is considered to be applicable only to off-policy algorithms. However, there have been some cases in which ER has been applied for on-policy algorithms, suggesting that off-policyness might be a sufficient condition for applying ER. This paper reconsiders more strict "experience replayable conditions" (ERC) and proposes the way of modifying the existing algorithms to satisfy ERC. In light of this, it is postulated that the instability of policy improvements represents a pivotal factor in ERC. The instability factors are revealed from the viewpoint of metric learning as i) repulsive forces from negative samples and ii) replays of inappropriate experiences. Accordingly, the corresponding stabilization tricks are derived. As a result, it is confirmed through numerical simulations that the proposed stabilization tricks make ER applicable to an advantage actor-critic, an on-policy algorithm. Moreover, its learning performance is comparable to that of a soft actor-critic, a state-of-the-art off-policy algorithm.

Revisiting Experience Replayable Conditions

TL;DR

It is confirmed through numerical simulations that the proposed stabilization tricks make ER applicable to an advantage actor-critic, an on-policy algorithm, and its learning performance is comparable to that of a soft actor-critic, a state-of-the-art off-policy algorithm.

Abstract

Experience replay (ER) used in (deep) reinforcement learning is considered to be applicable only to off-policy algorithms. However, there have been some cases in which ER has been applied for on-policy algorithms, suggesting that off-policyness might be a sufficient condition for applying ER. This paper reconsiders more strict "experience replayable conditions" (ERC) and proposes the way of modifying the existing algorithms to satisfy ERC. In light of this, it is postulated that the instability of policy improvements represents a pivotal factor in ERC. The instability factors are revealed from the viewpoint of metric learning as i) repulsive forces from negative samples and ii) replays of inappropriate experiences. Accordingly, the corresponding stabilization tricks are derived. As a result, it is confirmed through numerical simulations that the proposed stabilization tricks make ER applicable to an advantage actor-critic, an on-policy algorithm. Moreover, its learning performance is comparable to that of a soft actor-critic, a state-of-the-art off-policy algorithm.
Paper Structure (25 sections, 18 equations, 9 figures, 1 table)

This paper contains 25 sections, 18 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Concept of two stabilization tricks for satisfying ERC: Under the hypothesis that different algorithms have different sets of acceptable empirical data, the first trick, counteraction, expands that set, while the second trick, mining, selects the empirical data to be replayed.
  • Figure 2: Desired distance relationship in metric learning with triplet loss: The anchor data $x$ should be in the same cluster as the positive data $x^+$, and the negative data $x^-$ should be taken from a different cluster away from it.
  • Figure 3: Two stabilization tricks: According to the judgements of experience discriminator, (a) counteraction applies a regularization to $\pi$ to increase the misidentification rate to $\pi \simeq b$; and (b) mining masks empirical data judged to be $\pi \neq b$ and excludes them from replay/training.
  • Figure 4: Grid search of $(\eta_C, \eta_M)$ on DoublePendulum: $\eta_{C,M} = \{0.1, 0.5, 1.0, 5.0, 10.0\}$ are searched roughly in terms of return from the environment, and then $(\eta_C, \eta_M)$ used later are decided to be $(0.5, 2.0)$ referring to the grid search results.
  • Figure 5: Returns of ablation tests: The addition of the two stabilization tricks enabled the learning with an on-policy algorithm, A2C, combined with ER.
  • ...and 4 more figures