Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach
Henrique Donâncio, Antoine Barrier, Leah F. South, Florence Forbes
TL;DR
This work tackles the challenge of selecting an appropriate learning rate in deep reinforcement learning, which is hindered by non-stationary objectives. It introduces LRRL, a meta-learning approach that casts learning-rate adaptation as an adversarial multi-armed bandit problem, selecting among a small candidate set based on policy performance and employing time-decayed Exp3 updates. Empirically, LRRL achieves competitive or superior results compared to fixed rates and standard schedulers across Atari and MuJoCo tasks, and it remains robust when some candidates are poor; it also generalizes to stationary non-convex optimization with SGD by dynamically combining rates. Overall, LRRL offers a practical, algorithm-agnostic mechanism to adapt learning dynamics in non-stationary deep RL, reducing hyperparameter tuning while maintaining strong performance.
Abstract
In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve. Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments. We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps. LRRL adaptively favors rates that improve returns, remaining robust even when the candidate set includes values that individually cause divergence. Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers. Our findings position LRRL as a practical solution for adapting to non-stationary objectives in deep RL.
