Table of Contents
Fetching ...

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Haque Ishfaq, Qingfeng Lan, Pan Xu, A. Rupam Mahmood, Doina Precup, Anima Anandkumar, Kamyar Azizzadenesheli

TL;DR

A scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo method, to directly sample the Q function from its posterior distribution.

Abstract

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{3/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

TL;DR

A scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo method, to directly sample the Q function from its posterior distribution.

Abstract

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of , where is the dimension of the feature mapping, is the planning horizon, and is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
Paper Structure (37 sections, 28 theorems, 145 equations, 10 figures, 3 tables, 2 algorithms)

This paper contains 37 sections, 28 theorems, 145 equations, 10 figures, 3 tables, 2 algorithms.

Key Result

Proposition 3.1

The parameter $w_h^{k,J_k}$ used in episode $k$ of Algorithm Algorithm:LMC-TS follows a Gaussian distribution $\mathcal{N}(\mu_h^{k,J_k}, \Sigma_h^{k,J_k})$, with mean and covariance matrix: where $A_i = I - 2\eta_i \Lambda_h^i$ for $i \in [k]$.

Figures (10)

  • Figure 1: A comparison of Adam LMCDQN and other baselines in $N$-chain with different chain lengths $N$. All results are averaged over $20$ runs and the shaded areas represent standard errors. As $N$ increases, the exploration hardness increases.
  • Figure 2: The return curves of various algorithms in eight Atari tasks over 50 million training frames. Solid lines correspond to the median performance over 5 random seeds, and the shaded areas correspond to $90\%$ confidence interval.
  • Figure 3: (a) A comparison of Adam LMCDQN with different bias factor $a$ in Qbert. Solid lines correspond to the average performance over 5 random seeds, and shaded areas correspond to standard errors. The performance of Adam LMCDQN is greatly affected by the value of the bias factor. (b) A comparison of Adam LMCDQN with different values of inverse temperature parameter $\beta_k$ in Qbert. Adam LMCDQN is not very sensitive to inverse temperature $\beta_k$.
  • Figure 4: Comparison of LMC-LSVI, OPPO cai2020provably, LSVI-UCB jin2019provably and LSVI-PHE ishfaq2021randomized in randomly generated non-stationary linearly parameterized MDPs with 10 states, 4 actions, horizon length $H=100$ and a sparse transition matrix.
  • Figure 5: The 6 state RiverSwim environment from osband2013more. Here, state $s_1$ has a small reward while state $s_6$ has a large reward. The dotted arrows represent the action "left" and deterministically move the agent to the left. The continuous arrows denote the action "right" and move the agent to the right with a relatively high probability. This action represents swimming against the current, hence the name RiverSwim.
  • ...and 5 more figures

Theorems & Definitions (51)

  • Proposition 3.1
  • Definition 4.1: Linear MDP
  • Theorem 4.2
  • Remark 4.3
  • Remark 6.1
  • Proposition B.1
  • Definition B.2: Model prediction error
  • Lemma B.3
  • Lemma B.4
  • Lemma B.5
  • ...and 41 more