Table of Contents
Fetching ...

Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning

Antoine Moulin, Gergely Neu, Luca Viano

Abstract

We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving rate-optimal regret guarantees in this setting. Our main idea is to combine two classic techniques for optimistic exploration: additive exploration bonuses applied to the reward function, and artificial transitions made to an absorbing state with maximal return. We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order $\tilde{\mathcal{O}} (\sqrt{d^3 (1 - γ)^{- 7 / 2} T})$, where $T$ is the total number of sample transitions, $γ\in (0,1)$ is the discount factor, and $d$ is the feature dimensionality. The results continue to hold against adversarial reward sequences, enabling application of our method to the problem of imitation learning in linear MDPs, where we achieve state-of-the-art results.

Optimistically Optimistic Exploration for Provably Efficient Infinite-Horizon Reinforcement and Imitation Learning

Abstract

We study the problem of reinforcement learning in infinite-horizon discounted linear Markov decision processes (MDPs), and propose the first computationally efficient algorithm achieving rate-optimal regret guarantees in this setting. Our main idea is to combine two classic techniques for optimistic exploration: additive exploration bonuses applied to the reward function, and artificial transitions made to an absorbing state with maximal return. We show that, combined with a regularized approximate dynamic-programming scheme, the resulting algorithm achieves a regret of order , where is the total number of sample transitions, is the discount factor, and is the feature dimensionality. The results continue to hold against adversarial reward sequences, enabling application of our method to the problem of imitation learning in linear MDPs, where we achieve state-of-the-art results.

Paper Structure

This paper contains 56 sections, 39 theorems, 228 equations, 2 figures, 1 table, 4 algorithms.

Key Result

theorem 1

Suppose that Assumption ass:LinMDP holds, and that Algorithm alg:linear-rmax-ravi-ucb is executed with parameters specified in Appendix app:putting-together-main for a fixed number $K$ of episodes. Then, with probability at least $1 - \delta$,

Figures (2)

  • Figure 1: Illustration of the MDP $\mathcal{M}$ in black and its extension in blue. The MDP $\mathcal{M}^\mathsf{ +}$ contains the additional red dashed edges that allow ascension to heaven.
  • Figure 2: The thick arrows represent the transitions of the process in the original MDP, while the dashed ones correspond to the utopian one.

Theorems & Definitions (65)

  • theorem 1
  • corollary 1
  • Lemma 1
  • Lemma 1
  • Lemma 1
  • Lemma 1
  • Lemma 1
  • Lemma 1
  • theorem 2
  • Lemma 2
  • ...and 55 more