Table of Contents
Fetching ...

Efficient Restarts in Non-Stationary Model-Free Reinforcement Learning

Hiroshi Nonaka, Simon Ambrozak, Sofia R. Miskala-Dinc, Amedeo Ercole, Aviva Prins

TL;DR

This work tackles non-stationary model-free reinforcement learning by improving a RestartQ-UCB framework through three restart paradigms: partial, adaptive, and selective restarts. Partial restarts tighten the post-reset upper bounds on $Q$-values to preserve useful information, adaptive restarts trigger restarts based on observed reward dynamics, and selective restarts update only a targeted subset of the $Q$-table along observed trajectories. Empirically, these methods yield large improvements in dynamic regret across RandomMDP and BDCL, with reductions up to $74\%$ and $91\%$ respectively, while preserving near-optimal early performance and maintaining modest computational overhead. The results demonstrate that outer restart wrappers can substantially enhance practical performance of theoretically robust, model-free RL in non-stationary environments, bridging theory and practice. Potential future work includes deriving formal guarantees for adaptive/selective restarts, handling unknown budgets, and applying these wrappers to a wider class of stationary algorithms.

Abstract

In this work, we propose three efficient restart paradigms for model-free non-stationary reinforcement learning (RL). We identify two core issues with the restart design of Mao et al. (2022)'s RestartQ-UCB algorithm: (1) complete forgetting, where all the information learned about an environment is lost after a restart, and (2) scheduled restarts, in which restarts occur only at predefined timings, regardless of the incompatibility of the policy with the current environment dynamics. We introduce three approaches, which we call partial, adaptive, and selective restarts to modify the algorithms RestartQ-UCB and RANDOMIZEDQ (Wang et al., 2025). We find near-optimal empirical performance in multiple different environments, decreasing dynamic regret by up to $91$% relative to RestartQ-UCB.

Efficient Restarts in Non-Stationary Model-Free Reinforcement Learning

TL;DR

This work tackles non-stationary model-free reinforcement learning by improving a RestartQ-UCB framework through three restart paradigms: partial, adaptive, and selective restarts. Partial restarts tighten the post-reset upper bounds on -values to preserve useful information, adaptive restarts trigger restarts based on observed reward dynamics, and selective restarts update only a targeted subset of the -table along observed trajectories. Empirically, these methods yield large improvements in dynamic regret across RandomMDP and BDCL, with reductions up to and respectively, while preserving near-optimal early performance and maintaining modest computational overhead. The results demonstrate that outer restart wrappers can substantially enhance practical performance of theoretically robust, model-free RL in non-stationary environments, bridging theory and practice. Potential future work includes deriving formal guarantees for adaptive/selective restarts, handling unknown budgets, and applying these wrappers to a wider class of stationary algorithms.

Abstract

In this work, we propose three efficient restart paradigms for model-free non-stationary reinforcement learning (RL). We identify two core issues with the restart design of Mao et al. (2022)'s RestartQ-UCB algorithm: (1) complete forgetting, where all the information learned about an environment is lost after a restart, and (2) scheduled restarts, in which restarts occur only at predefined timings, regardless of the incompatibility of the policy with the current environment dynamics. We introduce three approaches, which we call partial, adaptive, and selective restarts to modify the algorithms RestartQ-UCB and RANDOMIZEDQ (Wang et al., 2025). We find near-optimal empirical performance in multiple different environments, decreasing dynamic regret by up to % relative to RestartQ-UCB.

Paper Structure

This paper contains 25 sections, 27 equations, 3 figures, 4 algorithms.

Figures (3)

  • Figure 1: This figure compares the impact of a partial restart (red) as opposed to a full restart (orange), when both are positioned to align with an abrupt change in BDCL. After the abrupt change and restarts at episode 1001, partial restarts allow for much faster learning than full restarts.
  • Figure 2: This figure demonstrates that adaptive restarts (blue) perform better than scheduled restarts (orange) in BDCL. On the left are adaptive restarts and scheduled restarts, showing that adaptive restarts only occur after each abrupt change and achieve a higher cumulative reward. This effect is further shown on the right, where RestartQ-UCB with adaptive and partial restarts (purple) receives nearly twice as much total reward as scheduled, full restarts.
  • Figure 3: From the left, each plot corresponds to RandomMDP, abrupt BDCL, gradual BDCL. In RandomMDP, RestartQ-UCB with adaptive and partial restarts achieves near-optimal total reward, and shows great improvement over base RestartQ-UCB. In BDCL environments, RestartQ-UCB with adaptive and partial restarts, as well as SelectiveRANDOMIZEDQ show an improved performance compared to base RestartQ-UCB. Notably, SelectiveRANDOMIZEDQ has near-zero dynamic regret in abrupt BDCL at episode 7,500, showing the promise of this approach.

Theorems & Definitions (1)

  • proof