Efficient Restarts in Non-Stationary Model-Free Reinforcement Learning
Hiroshi Nonaka, Simon Ambrozak, Sofia R. Miskala-Dinc, Amedeo Ercole, Aviva Prins
TL;DR
This work tackles non-stationary model-free reinforcement learning by improving a RestartQ-UCB framework through three restart paradigms: partial, adaptive, and selective restarts. Partial restarts tighten the post-reset upper bounds on $Q$-values to preserve useful information, adaptive restarts trigger restarts based on observed reward dynamics, and selective restarts update only a targeted subset of the $Q$-table along observed trajectories. Empirically, these methods yield large improvements in dynamic regret across RandomMDP and BDCL, with reductions up to $74\%$ and $91\%$ respectively, while preserving near-optimal early performance and maintaining modest computational overhead. The results demonstrate that outer restart wrappers can substantially enhance practical performance of theoretically robust, model-free RL in non-stationary environments, bridging theory and practice. Potential future work includes deriving formal guarantees for adaptive/selective restarts, handling unknown budgets, and applying these wrappers to a wider class of stationary algorithms.
Abstract
In this work, we propose three efficient restart paradigms for model-free non-stationary reinforcement learning (RL). We identify two core issues with the restart design of Mao et al. (2022)'s RestartQ-UCB algorithm: (1) complete forgetting, where all the information learned about an environment is lost after a restart, and (2) scheduled restarts, in which restarts occur only at predefined timings, regardless of the incompatibility of the policy with the current environment dynamics. We introduce three approaches, which we call partial, adaptive, and selective restarts to modify the algorithms RestartQ-UCB and RANDOMIZEDQ (Wang et al., 2025). We find near-optimal empirical performance in multiple different environments, decreasing dynamic regret by up to $91$% relative to RestartQ-UCB.
