Table of Contents
Fetching ...

Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps

Benjamin Ellis, Matthew T. Jackson, Andrei Lupu, Alexander D. Goldie, Mattie Fellows, Shimon Whiteson, Jakob Foerster

TL;DR

This work addresses nonstationarity in reinforcement learning by examining how the Adam optimizer responds to abrupt gradient changes and proposing Adam-Rel, which resets the local timestep at each new objective while preserving momentum estimates. The method provides theoretical intuition—bounding update sizes in large-gradient regimes and effectively performing learning-rate annealing when gradient changes are modest—and demonstrates empirical gains across both on-policy (PPO on Craftax-Classic and Atari-57) and off-policy (DQN on Atari-10) tasks. Key contributions include a formal analysis of gradient-scale effects on Adam, a simple one-line modification to implement Adam-Rel, and extensive experiments showing improved performance and robustness over Adam and Adam-MR. The results suggest that a lightweight, optimizer-centered approach to nonstationarity can yield substantial practical benefits in RL, with broad potential impact for future RL algorithm design.

Abstract

In reinforcement learning (RL), it is common to apply techniques used broadly in machine learning such as neural network function approximators and momentum-based optimizers. However, such tools were largely developed for supervised learning rather than nonstationary RL, leading practitioners to adopt target networks, clipped policy updates, and other RL-specific implementation tricks to combat this mismatch, rather than directly adapting this toolchain for use in RL. In this paper, we take a different approach and instead address the effect of nonstationarity by adapting the widely used Adam optimiser. We first analyse the impact of nonstationary gradient magnitude -- such as that caused by a change in target network -- on Adam's update size, demonstrating that such a change can lead to large updates and hence sub-optimal performance. To address this, we introduce Adam-Rel. Rather than using the global timestep in the Adam update, Adam-Rel uses the local timestep within an epoch, essentially resetting Adam's timestep to 0 after target changes. We demonstrate that this avoids large updates and reduces to learning rate annealing in the absence of such increases in gradient magnitude. Evaluating Adam-Rel in both on-policy and off-policy RL, we demonstrate improved performance in both Atari and Craftax. We then show that increases in gradient norm occur in RL in practice, and examine the differences between our theoretical model and the observed data.

Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps

TL;DR

This work addresses nonstationarity in reinforcement learning by examining how the Adam optimizer responds to abrupt gradient changes and proposing Adam-Rel, which resets the local timestep at each new objective while preserving momentum estimates. The method provides theoretical intuition—bounding update sizes in large-gradient regimes and effectively performing learning-rate annealing when gradient changes are modest—and demonstrates empirical gains across both on-policy (PPO on Craftax-Classic and Atari-57) and off-policy (DQN on Atari-10) tasks. Key contributions include a formal analysis of gradient-scale effects on Adam, a simple one-line modification to implement Adam-Rel, and extensive experiments showing improved performance and robustness over Adam and Adam-MR. The results suggest that a lightweight, optimizer-centered approach to nonstationarity can yield substantial practical benefits in RL, with broad potential impact for future RL algorithm design.

Abstract

In reinforcement learning (RL), it is common to apply techniques used broadly in machine learning such as neural network function approximators and momentum-based optimizers. However, such tools were largely developed for supervised learning rather than nonstationary RL, leading practitioners to adopt target networks, clipped policy updates, and other RL-specific implementation tricks to combat this mismatch, rather than directly adapting this toolchain for use in RL. In this paper, we take a different approach and instead address the effect of nonstationarity by adapting the widely used Adam optimiser. We first analyse the impact of nonstationary gradient magnitude -- such as that caused by a change in target network -- on Adam's update size, demonstrating that such a change can lead to large updates and hence sub-optimal performance. To address this, we introduce Adam-Rel. Rather than using the global timestep in the Adam update, Adam-Rel uses the local timestep within an epoch, essentially resetting Adam's timestep to 0 after target changes. We demonstrate that this avoids large updates and reduces to learning rate annealing in the absence of such increases in gradient magnitude. Evaluating Adam-Rel in both on-policy and off-policy RL, we demonstrate improved performance in both Atari and Craftax. We then show that increases in gradient norm occur in RL in practice, and examine the differences between our theoretical model and the observed data.

Paper Structure

This paper contains 31 sections, 1 theorem, 12 equations, 7 figures, 6 tables, 1 algorithm.

Key Result

Theorem 3.1

Assume that $\epsilon=0$. Let $g_t^i$ be defined as in eq:gradient_step and $\hat{m}_{-t',t}^i$ and $\hat{v}_{-t',t}^i$ be the momentum terms at timestep $t$ given Adam starts at timestep $-t'$. It follows that:

Figures (7)

  • Figure 1: Update size of Adam and Adam-Rel versus $k$ when considering nonstationary gradients. Assumes that optimization starts at time $-t'$, which is large, and that the gradients up until time $0$ are $g$ and then there is an increase in the gradient to $kg$.
  • Figure 2: Performance of Adam-Rel, Adam, Adam-MR, and Adam ($\beta_1 = \beta_2$) for PPO and Adam, Adam-MR and Adam-Rel for DQN on Atari-57 and Atari-10 respectively. Atari-10 uses a subset of Atari tasks to estimate median performance across the whole suite. Details can be found in aitchison2023atari. Error bars are 95% stratified bootstrapped confidence intervals. Results are across 10 seeds except for Adam ($\beta_1 = \beta_2$), which is 3 seeds.
  • Figure 3: PPO on Craftax-1B --- comparison of Adam-Rel against Adam, Adam-MR, and Adam with $\beta_1 = \beta_2$dohare2023overcoming. Bars show the 95% stratified bootstrap confidence interval, with mean marked, over 8 seeds agarwal2021deep.
  • Figure 4: Performance Profile of Adam and Adam-Rel on Atari-57. Error bars represent the standard error across 10 seeds. Green-shaded areas represent Adam-Rel outperforming Adam and red-shaded areas the opposite.
  • Figure 5: Adam and Adam-Rel compared to the theoretical model. To make this plot, we divided all the updates in the PPO run into chunks, each of which was optimising a stationary objective. We then averaged over all the chunks. The red dashed lines show the different epochs for each batch of data. The assumption about the gradient under the model is shown in the grad norm plot. Note that the update norm plot for Adam and Adam-Rel has separate y-axes. The shading represents standard error.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • proof