Table of Contents
Fetching ...

Q-learning with Posterior Sampling

Priyank Agrawal, Shipra Agrawal, Azmat Azati

TL;DR

The paper tackles exploration-exploitation in online, episodic, tabular RL by introducing Q-learning with Posterior Sampling (PSQL), which maintains Gaussian posteriors over Q-values and selects actions via Thompson-like sampling. A key innovation is computing optimistic multi-sample targets for the next-step value while using a single-sample action choice, enabling tractable regret analysis. The authors derive a Bayesian interpretation and an ELBO-based learning rule, establishing a near-optimal regret bound of $\tilde{O}(H^2\sqrt{SAT})$ (with $T=KH$) that closely matches the known lower bound $\Omega(H\sqrt{SAT})$, and showing how to manage the recursive error propagation caused by bootstrapping. The results demonstrate that simple, model-free posterior-sampling can achieve strong theoretical guarantees while maintaining practical efficiency, providing a starting point for extending posterior-sampling analyses to more complex RL settings.

Abstract

Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of $\tilde O(H^2\sqrt{SAT})$, closely matching the known lower bound of $Ω(H\sqrt{SAT})$. Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and $T=KH$ with $K$ being the number of episodes and $H$ being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.

Q-learning with Posterior Sampling

TL;DR

The paper tackles exploration-exploitation in online, episodic, tabular RL by introducing Q-learning with Posterior Sampling (PSQL), which maintains Gaussian posteriors over Q-values and selects actions via Thompson-like sampling. A key innovation is computing optimistic multi-sample targets for the next-step value while using a single-sample action choice, enabling tractable regret analysis. The authors derive a Bayesian interpretation and an ELBO-based learning rule, establishing a near-optimal regret bound of (with ) that closely matches the known lower bound , and showing how to manage the recursive error propagation caused by bootstrapping. The results demonstrate that simple, model-free posterior-sampling can achieve strong theoretical guarantees while maintaining practical efficiency, providing a starting point for extending posterior-sampling analyses to more complex RL settings.

Abstract

Bayesian posterior sampling techniques have demonstrated superior empirical performance in many exploration-exploitation settings. However, their theoretical analysis remains a challenge, especially in complex settings like reinforcement learning. In this paper, we introduce Q-Learning with Posterior Sampling (PSQL), a simple Q-learning-based algorithm that uses Gaussian posteriors on Q-values for exploration, akin to the popular Thompson Sampling algorithm in the multi-armed bandit setting. We show that in the tabular episodic MDP setting, PSQL achieves a regret bound of , closely matching the known lower bound of . Here, S, A denote the number of states and actions in the underlying Markov Decision Process (MDP), and with being the number of episodes and being the planning horizon. Our work provides several new technical insights into the core challenges in combining posterior sampling with dynamic programming and TD-learning-based RL algorithms, along with novel ideas for resolving those difficulties. We hope this will form a starting point for analyzing this efficient and important algorithmic technique in even more complex RL settings.

Paper Structure

This paper contains 31 sections, 38 theorems, 121 equations, 3 figures, 1 table, 3 algorithms.

Key Result

Theorem 1

The cumulative regret of our PSQL (Algorithm alg:main episodic ) in $K$ episodes with horizon $H$ is bounded as $\text{Reg}(K) \leq \tilde{O}\del{H^2\sqrt{SAT}},$ where $T=KH$.

Figures (3)

  • Figure 1: Performance comparison of PSQL$^*$(a heuristic derived from PSQL), UCBQL jin2018q; Staged-RandQLtiapkin2023model, and RLSVI russo2019worst in a chain MDP environment (for details, and more experiments, see Appendix \ref{['main: experiments']}).
  • Figure 2: Regret comparison: x-axes denotes episode index, y-axes denotes cumulative regret
  • Figure 3: Regret comparison: x-axes denotes episode index, y-axes denotes cumulative regret

Theorems & Definitions (65)

  • Theorem 1: Informal
  • Theorem 2
  • Lemma 1: Abridged
  • Lemma 2: Abridged
  • Lemma 3: Optimism error
  • Lemma 4: Posterior mean estimation error
  • Lemma 5: Cumulative estimation error.
  • Proposition B.1: Also in khan2023bayesianknoblauch2022optimization
  • proof
  • Lemma B.1
  • ...and 55 more