Table of Contents
Fetching ...

Efficient, Low-Regret, Online Reinforcement Learning for Linear MDPs

Philips George John, Arnab Bhattacharyya, Silviu Maniu, Dimitrios Myrisiotis, Zhenan Wu

TL;DR

Two modifications of LSVI-UCB are proposed, which alternate periods of learning and not-learning, to reduce space and time usage while maintaining sublinear regret and show experimentally that these algorithms achieve low space usage and running time, while not significantly sacrificing regret.

Abstract

Reinforcement learning algorithms are usually stated without theoretical guarantees regarding their performance. Recently, Jin, Yang, Wang, and Jordan (COLT 2020) showed a polynomial-time reinforcement learning algorithm (namely, LSVI-UCB) for the setting of linear Markov decision processes, and provided theoretical guarantees regarding its running time and regret. In real-world scenarios, however, the space usage of this algorithm can be prohibitive due to a utilized linear regression step. We propose and analyze two modifications of LSVI-UCB, which alternate periods of learning and not-learning, to reduce space and time usage while maintaining sublinear regret. We show experimentally, on synthetic data and real-world benchmarks, that our algorithms achieve low space usage and running time, while not significantly sacrificing regret.

Efficient, Low-Regret, Online Reinforcement Learning for Linear MDPs

TL;DR

Two modifications of LSVI-UCB are proposed, which alternate periods of learning and not-learning, to reduce space and time usage while maintaining sublinear regret and show experimentally that these algorithms achieve low space usage and running time, while not significantly sacrificing regret.

Abstract

Reinforcement learning algorithms are usually stated without theoretical guarantees regarding their performance. Recently, Jin, Yang, Wang, and Jordan (COLT 2020) showed a polynomial-time reinforcement learning algorithm (namely, LSVI-UCB) for the setting of linear Markov decision processes, and provided theoretical guarantees regarding its running time and regret. In real-world scenarios, however, the space usage of this algorithm can be prohibitive due to a utilized linear regression step. We propose and analyze two modifications of LSVI-UCB, which alternate periods of learning and not-learning, to reduce space and time usage while maintaining sublinear regret. We show experimentally, on synthetic data and real-world benchmarks, that our algorithms achieve low space usage and running time, while not significantly sacrificing regret.

Paper Structure

This paper contains 31 sections, 15 theorems, 42 equations, 9 figures, 3 tables, 3 algorithms.

Key Result

Theorem 1

Suppose $A \in \mathbb{R}^{n \times n}$ is an invertible square matrix and $u, v \in \mathbb{R}^n$ are vectors. Then $A + u v^\top$ is invertible if and only if $1 + v^\top A^{-1} u \neq 0$ and

Figures (9)

  • Figure 1: Synthetic data: Regret curve.
  • Figure 2: Synthetic data: Space usage.
  • Figure 3: Synthetic data: Running time.
  • Figure 4: Synthetic data: Space usage of LSVI-UCB-Fixed as a function of parameters.
  • Figure 5: Synthetic data: Space usage of LSVI-UCB-Adaptive as a function of parameters.
  • ...and 4 more figures

Theorems & Definitions (29)

  • Theorem 1: Sherman-Morrison
  • Definition 2: Linear MDP
  • Proposition 3: jin2023provably
  • Remark 4
  • Remark 5: On the space usage of LSVI-UCB
  • Proposition 6
  • proof : Proof (Sketch)
  • Proposition 7
  • Proposition 8
  • proof : Proof (Sketch)
  • ...and 19 more