Q-learning with Posterior Sampling

Priyank Agrawal; Shipra Agrawal; Azmat Azati

Q-learning with Posterior Sampling

Priyank Agrawal, Shipra Agrawal, Azmat Azati

TL;DR

The paper tackles exploration-exploitation in online, episodic, tabular RL by introducing Q-learning with Posterior Sampling (PSQL), which maintains Gaussian posteriors over Q-values and selects actions via Thompson-like sampling. A key innovation is computing optimistic multi-sample targets for the next-step value while using a single-sample action choice, enabling tractable regret analysis. The authors derive a Bayesian interpretation and an ELBO-based learning rule, establishing a near-optimal regret bound of $\tilde{O}(H^2\sqrt{SAT})$ (with $T=KH$) that closely matches the known lower bound $\Omega(H\sqrt{SAT})$, and showing how to manage the recursive error propagation caused by bootstrapping. The results demonstrate that simple, model-free posterior-sampling can achieve strong theoretical guarantees while maintaining practical efficiency, providing a starting point for extending posterior-sampling analyses to more complex RL settings.

Abstract

Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of $\tilde O(H^2\sqrt{SAT})$, closely matching the known lower bound of $Ω(H\sqrt{SAT})$. Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and $T=KH$ with $K$ being the number of episodes and $H$ being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.

Q-learning with Posterior Sampling

TL;DR

(with

) that closely matches the known lower bound

, and showing how to manage the recursive error propagation caused by bootstrapping. The results demonstrate that simple, model-free posterior-sampling can achieve strong theoretical guarantees while maintaining practical efficiency, providing a starting point for extending posterior-sampling analyses to more complex RL settings.

Abstract

, closely matching the known lower bound of

. Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and

with

being the number of episodes and

being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.

Q-learning with Posterior Sampling

TL;DR

Abstract

Q-learning with Posterior Sampling

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (65)