Q-learning with Posterior Sampling
Priyank Agrawal, Shipra Agrawal, Azmat Azati
TL;DR
The paper tackles exploration-exploitation in online, episodic, tabular RL by introducing Q-learning with Posterior Sampling (PSQL), which maintains Gaussian posteriors over Q-values and selects actions via Thompson-like sampling. A key innovation is computing optimistic multi-sample targets for the next-step value while using a single-sample action choice, enabling tractable regret analysis. The authors derive a Bayesian interpretation and an ELBO-based learning rule, establishing a near-optimal regret bound of $\tilde{O}(H^2\sqrt{SAT})$ (with $T=KH$) that closely matches the known lower bound $\Omega(H\sqrt{SAT})$, and showing how to manage the recursive error propagation caused by bootstrapping. The results demonstrate that simple, model-free posterior-sampling can achieve strong theoretical guarantees while maintaining practical efficiency, providing a starting point for extending posterior-sampling analyses to more complex RL settings.
Abstract
Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of $\tilde O(H^2\sqrt{SAT})$, closely matching the known lower bound of $Ω(H\sqrt{SAT})$. Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and $T=KH$ with $K$ being the number of episodes and $H$ being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.
