Table of Contents
Fetching ...

More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling

Haque Ishfaq, Yixin Tan, Yu Yang, Qingfeng Lan, Jianfeng Lu, A. Rupam Mahmood, Doina Precup, Pan Xu

TL;DR

This work addresses the gap between theory and practice in Thompson Sampling for reinforcement learning by introducing a flexible FGTS-based framework that incorporates approximate samplers (e.g., Langevin Monte Carlo and Underdamped LMC). It provides a general regret analysis for FGTS with approximation errors, and specializes to linear MDPs to achieve the best-known dimension-dependent regret $\tilde{O}(d H^{3/2} \sqrt{T})$, while explicitly characterizing the sampling complexity for each sampler. Empirically, FGTS with approximate samplers demonstrates strong deep exploration performance on challenging tasks such as N-chain and Atari games, matching or surpassing strong baselines. Overall, the paper bridges theory and practice by delivering practical, scalable exploration methods with provable guarantees and broad applicability to both linear and nonlinear RL settings.

Abstract

Thompson sampling (TS) is one of the most popular exploration techniques in reinforcement learning (RL). However, most TS algorithms with theoretical guarantees are difficult to implement and not generalizable to Deep RL. While the emerging approximate sampling-based exploration schemes are promising, most existing algorithms are specific to linear Markov Decision Processes (MDP) with suboptimal regret bounds, or only use the most basic samplers such as Langevin Monte Carlo. In this work, we propose an algorithmic framework that incorporates different approximate sampling methods with the recently proposed Feel-Good Thompson Sampling (FGTS) approach (Zhang, 2022; Dann et al., 2021), which was previously known to be computationally intractable in general. When applied to linear MDPs, our regret analysis yields the best known dependency of regret on dimensionality, surpassing existing randomized algorithms. Additionally, we provide explicit sampling complexity for each employed sampler. Empirically, we show that in tasks where deep exploration is necessary, our proposed algorithms that combine FGTS and approximate sampling perform significantly better compared to other strong baselines. On several challenging games from the Atari 57 suite, our algorithms achieve performance that is either better than or on par with other strong baselines from the deep RL literature.

More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling

TL;DR

This work addresses the gap between theory and practice in Thompson Sampling for reinforcement learning by introducing a flexible FGTS-based framework that incorporates approximate samplers (e.g., Langevin Monte Carlo and Underdamped LMC). It provides a general regret analysis for FGTS with approximation errors, and specializes to linear MDPs to achieve the best-known dimension-dependent regret , while explicitly characterizing the sampling complexity for each sampler. Empirically, FGTS with approximate samplers demonstrates strong deep exploration performance on challenging tasks such as N-chain and Atari games, matching or surpassing strong baselines. Overall, the paper bridges theory and practice by delivering practical, scalable exploration methods with provable guarantees and broad applicability to both linear and nonlinear RL settings.

Abstract

Thompson sampling (TS) is one of the most popular exploration techniques in reinforcement learning (RL). However, most TS algorithms with theoretical guarantees are difficult to implement and not generalizable to Deep RL. While the emerging approximate sampling-based exploration schemes are promising, most existing algorithms are specific to linear Markov Decision Processes (MDP) with suboptimal regret bounds, or only use the most basic samplers such as Langevin Monte Carlo. In this work, we propose an algorithmic framework that incorporates different approximate sampling methods with the recently proposed Feel-Good Thompson Sampling (FGTS) approach (Zhang, 2022; Dann et al., 2021), which was previously known to be computationally intractable in general. When applied to linear MDPs, our regret analysis yields the best known dependency of regret on dimensionality, surpassing existing randomized algorithms. Additionally, we provide explicit sampling complexity for each employed sampler. Empirically, we show that in tasks where deep exploration is necessary, our proposed algorithms that combine FGTS and approximate sampling perform significantly better compared to other strong baselines. On several challenging games from the Atari 57 suite, our algorithms achieve performance that is either better than or on par with other strong baselines from the deep RL literature.
Paper Structure (30 sections, 17 theorems, 63 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 30 sections, 17 theorems, 63 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 4.6

Under Assumption assumption_realizability, assumption_boundedness and assumption_completeness, if $\eta\leq 2/5b^2$, then where the expectation incorporates the inherent randomness of the MDP through samples drawn from it and the algorithm's own stochastic elements. If we further set $\eta = 1/4b^2$, $\lambda = \sqrt{K\kappa(b/K^2)/b^2\mathcal{K}_{\text{DC}}}$ and assume $\lambda b^2 \geq 1$ and

Figures (4)

  • Figure 1: A comparison of different methods in $N$-chain with different chain lengths $N$. As $N$ increases, the exploration hardness increases. All results are averaged over $20$ runs and the shaded areas represent $95\%$ confidence interval.
  • Figure 2: The return curves of various algorithms in eight hard Atari tasks over 50 million training frames. Solid lines correspond to the median performance over 5 random seeds, and the shaded areas correspond to $95\%$ confidence interval.
  • Figure 3: (a) A comparison of FG-ULMCDQN with different values of weight $\eta$ for the feel good prior term in Alien. Solid lines correspond to the average performance over 5 random seeds, and shaded areas correspond to $95\%$ confidence interval. The performance of FG-ULMCDQN is not very sensitive to the values of $\eta$ as long it is not very large. (b) A comparison of ULMCDQN with different values of the friction coefficient $\gamma$ in Alien.
  • Figure 4: N-Chain environment osband2016deep.

Theorems & Definitions (30)

  • Definition 4.4
  • Definition 4.5
  • Theorem 4.6
  • Remark 4.7
  • Remark 4.8
  • Remark 4.9
  • Remark 4.10
  • Proposition 4.11
  • Remark 4.12
  • Remark 4.13
  • ...and 20 more