Table of Contents
Fetching ...

Online Reinforcement Learning with Passive Memory

Anay Pattanaik, Lav R. Varshney

TL;DR

It is shown that using passive memory improves performance and further provide theoretical guarantees for regret that turns out to be near-minimax optimal, and results show that quality of passive memory determines sub-optimality of the incurred regret.

Abstract

This paper considers an online reinforcement learning algorithm that leverages pre-collected data (passive memory) from the environment for online interaction. We show that using passive memory improves performance and further provide theoretical guarantees for regret that turns out to be near-minimax optimal. Results show that the quality of passive memory determines sub-optimality of the incurred regret. The proposed approach and results hold in both continuous and discrete state-action spaces.

Online Reinforcement Learning with Passive Memory

TL;DR

It is shown that using passive memory improves performance and further provide theoretical guarantees for regret that turns out to be near-minimax optimal, and results show that quality of passive memory determines sub-optimality of the incurred regret.

Abstract

This paper considers an online reinforcement learning algorithm that leverages pre-collected data (passive memory) from the environment for online interaction. We show that using passive memory improves performance and further provide theoretical guarantees for regret that turns out to be near-minimax optimal. Results show that the quality of passive memory determines sub-optimality of the incurred regret. The proposed approach and results hold in both continuous and discrete state-action spaces.

Paper Structure

This paper contains 17 sections, 9 theorems, 40 equations.

Key Result

Theorem 1

Given a dataset $d^{\mathcal{D}}(s,a)$ for an MDP $\mathcal{M}$, the difference in the performance of the optimal policy and the policy produced by regularized LP formulation of RL is given by Here $d^*(s,a)$ is the state-action distribution induced by the optimal policy, $c=\frac{1}{\left\Vert\frac{\mu_0}{d^{\mathcal{D}}}\right\Vert_{-\infty}}$, and $||\cdot||_{-\infty}$ is the short hand notati

Theorems & Definitions (10)

  • Theorem 1: Performance difference analysis
  • Lemma 1
  • Theorem 2: Minimax regret lower bound
  • Theorem 3: Regret upper bound with density estimation error
  • Lemma 2: Plug-in density estimator
  • Definition 1: State-action kernel density estimator
  • Lemma 3
  • Lemma 4
  • Theorem 4: Regret upper bound for continuous case
  • Corollary 4.1: Regret upper bound for discrete case