Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning

Jello Zhou; Vudtiwat Ngampruetikorn; David J. Schwab

Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning

Jello Zhou, Vudtiwat Ngampruetikorn, David J. Schwab

Abstract

Stochastic resetting, where a dynamical process is intermittently returned to a fixed reference state, has emerged as a powerful mechanism for optimizing first-passage properties. Existing theory largely treats static, non-learning processes. Here we ask how stochastic resetting interacts with reinforcement learning, where the underlying dynamics adapt through experience. In tabular grid environments, we find that resetting accelerates policy convergence even when it does not reduce the search time of a purely diffusive agent, indicating a novel mechanism beyond classical first-passage optimization. In a continuous control task with neural-network-based value approximation, we show that random resetting improves deep reinforcement learning when exploration is difficult and rewards are sparse. Unlike temporal discounting, resetting preserves the optimal policy while accelerating convergence by truncating long, uninformative trajectories to enhance value propagation. Our results establish stochastic resetting as a simple, tunable mechanism for accelerating learning, translating a canonical phenomenon of statistical mechanics into an optimization principle for reinforcement learning.

Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning

Abstract

Paper Structure (14 sections, 2 equations, 17 figures)

This paper contains 14 sections, 2 equations, 17 figures.

Introduction
Results
Resetting accelerates policy convergence beyond search optimization
Resetting modifies training dynamics without changing the optimal policy
Resetting can accelerate DQN learning
Discussion
Materials and methods
Tabular grid environments
MountainCar environment
Stochastic resetting protocol
Dynamic programming baselines
Supplementary Information
First-passage time distributions
Supporting figures for learning dynamics

Figures (17)

Figure 1: Competing effects of resetting on search efficiency. Two trajectories (solid lines) begin at the same starting position. The first wanders away from the goal; resetting it to the starting position (dashed line) shortens the distance to the goal. The second has already moved closer to the goal; resetting it increases the distance from the goal and increases the search time. Whether resetting accelerates search on average depends on the balance between these two cases---a balance controlled by the reset rate, the environment and the underlying stochastic process.
Figure 2: Learning dynamics in GridWorld. (A) Median evaluation episode length during training for grid sizes $N=120$ and $N=60$, shown for reset rates $r=0$, $0.0015$, and $0.003$ (250 trials per condition). Plots are zoomed in to regions where learning dynamics are most prominent. (B) Separation of search and learning effects. Dashed curves (left axis): numerically computed median first-passage time (FPT) of a random walker, averaged over 2500 trials. Solid curves (right axis): median number of training steps until evaluation episode length converges to its optimum of 40 steps, shown for three exploration rates $\varepsilon$. (C) Evolution of the final contiguous path from the last reset to the goal over training episodes. (D) Heatmaps of median evaluation episode length as a function of reset rate and training steps (colour bar capped at 4000 steps). Full training, testing, and last-path curves are shown in Figs. \ref{['fig:gridworld_extended_N60']}, \ref{['fig:gridworld_extended_N120']}.
Figure 3: Learning dynamics in WindyCliff. (A) Median evaluation episode length at $\gamma=0.98$ for reset rates $r=0, 0.0003$, and $0.003$ (250 trials per condition). Inset: extended view to $1.5\times10^6$ training steps. (B) Learning curves at $r=0$ for $\gamma=0.5$, $0.6$, and $0.98$. (C) Heatmap of median evaluation episode length over reset rates and training steps (capped at 1000). (D) Approach of evaluation episode length toward dynamic-programming (DP) optimal path length $L^*(\gamma)$ at $\gamma=0.98$ for varying $r$. (E) Same approach for $r=0$ and varying $\gamma$. (F) DP-optimal paths from start to goal for three values of $\gamma$. These results are for grid width $300$, height $150$, $p_w=0.005$, $s_w=3$. Full training, testing (including cliff falls), and last-path curves are shown in Fig. \ref{['fig:windycliff-extended']}.
Figure 4: Stochastic resetting accelerates deep reinforcement learning.(A) Schematic of the MountainCar environment: an underpowered car starting at the valley bottom must build momentum to reach the goal (flag). The left boundary is extended to $-1.7$, creating a deep trap that makes unassisted goal discovery difficult. (B) Fraction of replicates achieving evaluation performance $\leq 200$ steps as a function of cumulative training steps for a range of reset rates (see legend). We see that intermediate reset rates accelerate learning relative to the no-resetting baseline. (C) Median evaluation steps to goal. (D) Median cumulative number of goals reached during training. Intermediate reset rates increase the rate at which the agent encounters the goal, but excessively high rates are counterproductive. In all panels, increasing the reset rate from zero initially improves performance; beyond an intermediate optimum, further increases degrade it. The results are for the sparse reward scheme (i.e., reward is $+1$ upon reaching the goal and zero otherwise) and 512 replicates for each reset rate.
Figure 5: Schematic of the tabular GridWorld and WindyCliff environments. In both settings the agent navigates a grid, learning via Q learning to find the shortest path from the start to the goal. At every training step the agent resets to the start with probability $r$ and otherwise follows the standard $\varepsilon$-greedy policy. In GridWorld the agent receives reward $+1$ at the goal and zero otherwise. In WindyCliff the goal reward is $+10$ and cliff falls incur a $-100$ penalty; optional stochastic wind blows the agent downward $s_w$ steps with probability $p_w$.
...and 12 more figures

Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning

Abstract

Stochastic Resetting Accelerates Policy Convergence in Reinforcement Learning

Authors

Abstract

Table of Contents

Figures (17)