Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Haque Ishfaq; Qingfeng Lan; Pan Xu; A. Rupam Mahmood; Doina Precup; Anima Anandkumar; Kamyar Azizzadenesheli

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Haque Ishfaq, Qingfeng Lan, Pan Xu, A. Rupam Mahmood, Doina Precup, Anima Anandkumar, Kamyar Azizzadenesheli

TL;DR

A scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL) by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo method, to directly sample the Q function from its posterior distribution.

Abstract

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcomings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of $\tilde{O}(d^{3/2}H^{3/2}\sqrt{T})$, where $d$ is the dimension of the feature mapping, $H$ is the planning horizon, and $T$ is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

TL;DR

Abstract

, where

is the dimension of the feature mapping,

is the planning horizon, and

is the total number of steps. We apply this approach to deep RL, by using Adam optimizer to perform gradient updates. Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.

Paper Structure (37 sections, 28 theorems, 145 equations, 10 figures, 3 tables, 2 algorithms)

This paper contains 37 sections, 28 theorems, 145 equations, 10 figures, 3 tables, 2 algorithms.

Introduction
Preliminary
Langevin Monte Carlo for Reinforcement Learning
Theoretical Analysis
Deep Q-Network with LMC Exploration
Experiments
Demonstration of Deep Exploration
Evaluation in Atari Games
Conclusion and Future Work
Related Work
Proof of the Regret Bound of LMC-LSVI
Supporting Lemmas
Regret Analysis
Proof of Supporting Lemmas
Proof of \ref{['Prop:w_gaussian']}
...and 22 more sections

Key Result

Proposition 3.1

The parameter $w_h^{k,J_k}$ used in episode $k$ of Algorithm Algorithm:LMC-TS follows a Gaussian distribution $\mathcal{N}(\mu_h^{k,J_k}, \Sigma_h^{k,J_k})$, with mean and covariance matrix: where $A_i = I - 2\eta_i \Lambda_h^i$ for $i \in [k]$.

Figures (10)

Figure 1: A comparison of Adam LMCDQN and other baselines in $N$-chain with different chain lengths $N$. All results are averaged over $20$ runs and the shaded areas represent standard errors. As $N$ increases, the exploration hardness increases.
Figure 2: The return curves of various algorithms in eight Atari tasks over 50 million training frames. Solid lines correspond to the median performance over 5 random seeds, and the shaded areas correspond to $90\%$ confidence interval.
Figure 3: (a) A comparison of Adam LMCDQN with different bias factor $a$ in Qbert. Solid lines correspond to the average performance over 5 random seeds, and shaded areas correspond to standard errors. The performance of Adam LMCDQN is greatly affected by the value of the bias factor. (b) A comparison of Adam LMCDQN with different values of inverse temperature parameter $\beta_k$ in Qbert. Adam LMCDQN is not very sensitive to inverse temperature $\beta_k$.
Figure 4: Comparison of LMC-LSVI, OPPO cai2020provably, LSVI-UCB jin2019provably and LSVI-PHE ishfaq2021randomized in randomly generated non-stationary linearly parameterized MDPs with 10 states, 4 actions, horizon length $H=100$ and a sparse transition matrix.
Figure 5: The 6 state RiverSwim environment from osband2013more. Here, state $s_1$ has a small reward while state $s_6$ has a large reward. The dotted arrows represent the action "left" and deterministically move the agent to the left. The continuous arrows denote the action "right" and move the agent to the right with a relatively high probability. This action represents swimming against the current, hence the name RiverSwim.
...and 5 more figures

Theorems & Definitions (51)

Proposition 3.1
Definition 4.1: Linear MDP
Theorem 4.2
Remark 4.3
Remark 6.1
Proposition B.1
Definition B.2: Model prediction error
Lemma B.3
Lemma B.4
Lemma B.5
...and 41 more

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

TL;DR

Abstract

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (51)