Posterior Sampling for Continuing Environments
Wanqiao Xu, Shi Dong, Benjamin Van Roy
TL;DR
The paper introduces Continuing PSRL, a posterior-sampling reinforcement learning method tailored for continuing environments, where the agent resamples a new environment model with probability $1-\gamma$ and plans using a $\gamma$-discounted objective. It proves a sublinear Bayesian regret bound $\tilde{O}(\tau S \sqrt{A T})$ and shows how a time-varying discount schedule $\gamma_t$ yields sublinear regret with respect to the optimal average reward, without requiring episodic resets. The analysis leverages a value-decomposition lemma and confidence sets to bound Bellman-error terms, achieving results comparable to previous TSDE-type approaches but with a simpler, scalable resampling mechanism. Empirical results on tabular and continuous RiverSwim variants demonstrate competitive performance and illustrate the practicality of resampling-based exploration in large or non-resetting environments, including fixes for function-approximation settings like bootstrapped DQN. Overall, the work clarifies the role of discounting in continuing RL and provides a scalable, theoretically-grounded exploration strategy for complex environments.
Abstract
We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $γ$-discounted return in that model. At each time, with probability $1-γ$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(τS \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $τ$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.
