Table of Contents
Fetching ...

Learning Rate-Free Reinforcement Learning: A Case for Model Selection with Non-Stationary Objectives

Aida Afshar, Aldo Pacchiano

TL;DR

The paper addresses RL sensitivity to hyperparameters by introducing a model-selection framework that tunes the learning rate online using reward feedback, effectively enabling a learning-rate-free RL setup that can wrap any RL algorithm. It formalizes a meta-learning architecture with $m$ base agents, each with learning rate $\alpha^i$ and policy $\pi^i$, and a meta-learner that selects among them. Data-driven regret-balancing methods $D^3$RB and ED$^2$RB outperform bandit baselines in non-stationary settings, with regret bounded near $O(\sqrt{N})$. Empirical results on PPO in the Humanoid task show improved convergence and reduced manual hyperparameter tuning, and the work suggests extensions to tuning multiple hyperparameters and sharing data across bases.

Abstract

The performance of reinforcement learning (RL) algorithms is sensitive to the choice of hyperparameters, with the learning rate being particularly influential. RL algorithms fail to reach convergence or demand an extensive number of samples when the learning rate is not optimally set. In this work, we show that model selection can help to improve the failure modes of RL that are due to suboptimal choices of learning rate. We present a model selection framework for Learning Rate-Free Reinforcement Learning that employs model selection methods to select the optimal learning rate on the fly. This approach of adaptive learning rate tuning neither depends on the underlying RL algorithm nor the optimizer and solely uses the reward feedback to select the learning rate; hence, the framework can input any RL algorithm and produce a learning rate-free version of it. We conduct experiments for policy optimization methods and evaluate various model selection strategies within our framework. Our results indicate that data-driven model selection algorithms are better alternatives to standard bandit algorithms when the optimal choice of hyperparameter is time-dependent and non-stationary.

Learning Rate-Free Reinforcement Learning: A Case for Model Selection with Non-Stationary Objectives

TL;DR

The paper addresses RL sensitivity to hyperparameters by introducing a model-selection framework that tunes the learning rate online using reward feedback, effectively enabling a learning-rate-free RL setup that can wrap any RL algorithm. It formalizes a meta-learning architecture with base agents, each with learning rate and policy , and a meta-learner that selects among them. Data-driven regret-balancing methods RB and EDRB outperform bandit baselines in non-stationary settings, with regret bounded near . Empirical results on PPO in the Humanoid task show improved convergence and reduced manual hyperparameter tuning, and the work suggests extensions to tuning multiple hyperparameters and sharing data across bases.

Abstract

The performance of reinforcement learning (RL) algorithms is sensitive to the choice of hyperparameters, with the learning rate being particularly influential. RL algorithms fail to reach convergence or demand an extensive number of samples when the learning rate is not optimally set. In this work, we show that model selection can help to improve the failure modes of RL that are due to suboptimal choices of learning rate. We present a model selection framework for Learning Rate-Free Reinforcement Learning that employs model selection methods to select the optimal learning rate on the fly. This approach of adaptive learning rate tuning neither depends on the underlying RL algorithm nor the optimizer and solely uses the reward feedback to select the learning rate; hence, the framework can input any RL algorithm and produce a learning rate-free version of it. We conduct experiments for policy optimization methods and evaluate various model selection strategies within our framework. Our results indicate that data-driven model selection algorithms are better alternatives to standard bandit algorithms when the optimal choice of hyperparameter is time-dependent and non-stationary.
Paper Structure (6 sections, 6 equations, 3 figures, 8 algorithms)

This paper contains 6 sections, 6 equations, 3 figures, 8 algorithms.

Figures (3)

  • Figure 1: Learning Rate-Free PPO on Humanoid Environment. Each curve shows the mean and standard deviation of normalized reward per step over three seeds.
  • Figure 2: Selection frequency of each base learner for Learning Rate-Free PPO on Humanoid environment. The y-axis indicates the base learner's index, and the x-axis indicates the timestep. Each (x,y) point shows that base learner y was selected and played by meta learner at time x.
  • Figure 3: (Left) Number of times D$^3$RB has played each learning rate. (Right) Maximum reward of PPO agents initiated with the same set of learning rates. We can see that D$^3$RB is playing base learners with higher rewards more frequently than base learners with suboptimal performance.