Table of Contents
Fetching ...

Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach

Henrique Donâncio, Antoine Barrier, Leah F. South, Florence Forbes

TL;DR

This work tackles the challenge of selecting an appropriate learning rate in deep reinforcement learning, which is hindered by non-stationary objectives. It introduces LRRL, a meta-learning approach that casts learning-rate adaptation as an adversarial multi-armed bandit problem, selecting among a small candidate set based on policy performance and employing time-decayed Exp3 updates. Empirically, LRRL achieves competitive or superior results compared to fixed rates and standard schedulers across Atari and MuJoCo tasks, and it remains robust when some candidates are poor; it also generalizes to stationary non-convex optimization with SGD by dynamically combining rates. Overall, LRRL offers a practical, algorithm-agnostic mechanism to adapt learning dynamics in non-stationary deep RL, reducing hyperparameter tuning while maintaining strong performance.

Abstract

In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve. Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments. We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps. LRRL adaptively favors rates that improve returns, remaining robust even when the candidate set includes values that individually cause divergence. Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers. Our findings position LRRL as a practical solution for adapting to non-stationary objectives in deep RL.

Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach

TL;DR

This work tackles the challenge of selecting an appropriate learning rate in deep reinforcement learning, which is hindered by non-stationary objectives. It introduces LRRL, a meta-learning approach that casts learning-rate adaptation as an adversarial multi-armed bandit problem, selecting among a small candidate set based on policy performance and employing time-decayed Exp3 updates. Empirically, LRRL achieves competitive or superior results compared to fixed rates and standard schedulers across Atari and MuJoCo tasks, and it remains robust when some candidates are poor; it also generalizes to stationary non-convex optimization with SGD by dynamically combining rates. Overall, LRRL offers a practical, algorithm-agnostic mechanism to adapt learning dynamics in non-stationary deep RL, reducing hyperparameter tuning while maintaining strong performance.

Abstract

In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve. Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments. We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps. LRRL adaptively favors rates that improve returns, remaining robust even when the candidate set includes values that individually cause divergence. Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers. Our findings position LRRL as a practical solution for adapting to non-stationary objectives in deep RL.

Paper Structure

This paper contains 32 sections, 9 equations, 14 figures, 3 tables, 2 algorithms.

Figures (14)

  • Figure 1: LRRL and standard DQN comparison: 5 variants of LRRL with different learning rate sets are tested against the DQN algorithm reaching best performance among the possible learning rates (provided in Appendix \ref{['fig:dqn_baseline_adam']}). The mean and one-half standard deviations over 5 runs are represented.
  • Figure 2: Systematic sampling of normalized learning rates and returns over the training steps using LRRL $\mathcal{K}(5)$ with Adam optimizer, through a single run. For each episode, we show the selected learning rate using different colors. Lower rates are increasingly selected over time.
  • Figure 3: LRRL vs 3 individual schedulers ($d=1,2,3$). The dashed black line represents the max average return achieved by Adam with a constant learning rate set to $6.25 \times 10^{-5}$.
  • Figure 4: Results of the ablation study evaluating the impact of varying $\kappa$ with LRRL using schedulers. The dashed black line represents the max average return achieved by the best performing scheduler.
  • Figure 5: Comparison of LRRL, AdamRel, and CLR on MuJoCo tasks using PPO. LRRL achieves higher average returns in two tasks. Lines represent the mean and shaded regions show one-half standard deviations over 10 seeds, with training run for 1M frames.
  • ...and 9 more figures