Table of Contents
Fetching ...

Is Prior-Free Black-Box Non-Stationary Reinforcement Learning Feasible?

Argyrios Gerogiannis, Yu-Han Huang, Venugopal V. Veeravalli

TL;DR

This work interrogates the practicality of prior-free black-box NS-RL using the MASTER framework. It demonstrates that for realistic horizons, MASTER’s non-stationarity detection is unlikely to trigger, causing performance to resemble a random-restart strategy with memory, and that its theoretically derived regret bounds remain trivial for very large horizons. Through experiments on piecewise stationary MABs, the study shows that quickest-change-detection (QCD) based methods are more robust and consistently outperform MASTER and naive random restarting. The results motivate the development of principled, QCD-based detectors for NS-RL that extend to broader settings beyond multi-armed bandits, highlighting a gap between order-optimal theory and practical feasibility. Overall, the paper emphasizes that effective change detection is key to viable NS-RL and suggests that future work should integrate rigorous QCD-based mechanisms into black-box RL systems.

Abstract

We study the problem of Non-Stationary Reinforcement Learning (NS-RL) without prior knowledge about the system's non-stationarity. A state-of-the-art, black-box algorithm, known as MASTER, is considered, with a focus on identifying the conditions under which it can achieve its stated goals. Specifically, we prove that MASTER's non-stationarity detection mechanism is not triggered for practical choices of horizon, leading to performance akin to a random restarting algorithm. Moreover, we show that the regret bound for MASTER, while being order optimal, stays above the worst-case linear regret until unreasonably large values of the horizon. To validate these observations, MASTER is tested for the special case of piecewise stationary multi-armed bandits, along with methods that employ random restarting, and others that use quickest change detection to restart. A simple, order optimal random restarting algorithm, that has prior knowledge of the non-stationarity is proposed as a baseline. The behavior of the MASTER algorithm is validated in simulations, and it is shown that methods employing quickest change detection are more robust and consistently outperform MASTER and other random restarting approaches.

Is Prior-Free Black-Box Non-Stationary Reinforcement Learning Feasible?

TL;DR

This work interrogates the practicality of prior-free black-box NS-RL using the MASTER framework. It demonstrates that for realistic horizons, MASTER’s non-stationarity detection is unlikely to trigger, causing performance to resemble a random-restart strategy with memory, and that its theoretically derived regret bounds remain trivial for very large horizons. Through experiments on piecewise stationary MABs, the study shows that quickest-change-detection (QCD) based methods are more robust and consistently outperform MASTER and naive random restarting. The results motivate the development of principled, QCD-based detectors for NS-RL that extend to broader settings beyond multi-armed bandits, highlighting a gap between order-optimal theory and practical feasibility. Overall, the paper emphasizes that effective change detection is key to viable NS-RL and suggests that future work should integrate rigorous QCD-based mechanisms into black-box RL systems.

Abstract

We study the problem of Non-Stationary Reinforcement Learning (NS-RL) without prior knowledge about the system's non-stationarity. A state-of-the-art, black-box algorithm, known as MASTER, is considered, with a focus on identifying the conditions under which it can achieve its stated goals. Specifically, we prove that MASTER's non-stationarity detection mechanism is not triggered for practical choices of horizon, leading to performance akin to a random restarting algorithm. Moreover, we show that the regret bound for MASTER, while being order optimal, stays above the worst-case linear regret until unreasonably large values of the horizon. To validate these observations, MASTER is tested for the special case of piecewise stationary multi-armed bandits, along with methods that employ random restarting, and others that use quickest change detection to restart. A simple, order optimal random restarting algorithm, that has prior knowledge of the non-stationarity is proposed as a baseline. The behavior of the MASTER algorithm is validated in simulations, and it is shown that methods employing quickest change detection are more robust and consistently outperform MASTER and other random restarting approaches.

Paper Structure

This paper contains 36 sections, 12 theorems, 55 equations, 15 figures, 20 tables, 3 algorithms.

Key Result

Lemma 1

(Adapted from wei2021non.) Let $\hat{n} = \log_2 T + 1$, $\hat{\rho}(t) = 6\hat{n} \log(T/\delta)\rho(t)$ and $t' = t - alg.s + 1$. Suppose Assumption assumption_one holds for ALG, and that $n \leq \log_2 T$. Then, for any instance $alg$, with start at $alg.s$ and finish at $alg.e$, that is maintain

Figures (15)

  • Figure 1: Dynamic regret plots versus the time steps for $T=100000$, averaged over $4000$ independent runs. The case of the geometric change-points is on the top row for $\xi=0.4,0.6,0.8$ and the case of the deterministic change-points on the bottom row for $N_C=1000,101,10$. Left: Uniform problem. Right: Worst-case problem.
  • Figure 2: Final dynamic regret versus the decrease in non-stationarity, for $T=100000$, $4000$ runs.
  • Figure 3: Final dynamic regret versus the decrease in non-stationarity.
  • Figure 5: Final dynamic regret versus the decrease in non-stationarity.
  • Figure 7: Final dynamic regret versus the decrease in non-stationarity.
  • ...and 10 more figures

Theorems & Definitions (19)

  • Definition 1
  • Definition 2
  • Lemma 1
  • Theorem 1
  • Lemma 2
  • Remark 1
  • Corollary 1
  • Corollary 2
  • Definition 3
  • Definition 4
  • ...and 9 more