State-free Reinforcement Learning

Mingyu Chen; Aldo Pacchiano; Xuezhou Zhang

State-free Reinforcement Learning

Mingyu Chen, Aldo Pacchiano, Xuezhou Zhang

TL;DR

This work designs an algorithm which requires no information on the state space $S$ while having a regret that is completely independent of ${S}$ and only depend on ${S}^\Pi, with the goal of designing RL algorithms that require no hyper-parameter tuning.

Abstract

In this work, we study the \textit{state-free RL} problem, where the algorithm does not have the states information before interacting with the environment. Specifically, denote the reachable state set by ${S}^Π:= \{ s|\max_{π\in Π}q^{P, π}(s)>0 \}$, we design an algorithm which requires no information on the state space $S$ while having a regret that is completely independent of ${S}$ and only depend on ${S}^Π$. We view this as a concrete first step towards \textit{parameter-free RL}, with the goal of designing RL algorithms that require no hyper-parameter tuning.

State-free Reinforcement Learning

TL;DR

This work designs an algorithm which requires no information on the state space

while having a regret that is completely independent of

and only depend on ${S}^\Pi, with the goal of designing RL algorithms that require no hyper-parameter tuning.

Abstract

, we design an algorithm which requires no information on the state space

while having a regret that is completely independent of

and only depend on

. We view this as a concrete first step towards \textit{parameter-free RL}, with the goal of designing RL algorithms that require no hyper-parameter tuning.

Paper Structure (25 sections, 10 theorems, 71 equations, 1 figure, 1 algorithm)

This paper contains 25 sections, 10 theorems, 71 equations, 1 figure, 1 algorithm.

Introduction
Related Works
Parameter-free algorithms:
Instance-dependent algorithms:
Problem Formulation
Technical challenges
Black-box reduction for State-free RL
Proof Highlight:
Improved regret bound for State-free RL
Conclusion
Omitted details for Section 2
Details for Proposition \ref{['prop_1']}
Details for Remark \ref{['remark_1']}
Details for Proposition \ref{['prop_2']}
Omitted proof of Section 3
...and 10 more sections

Key Result

Proposition 4.1

For stochastic MDPs, UCBVIazar2017minimax is a weakly state-free algorithm, that is, with only the knowledge of $\mathcal{S}$, the regret guarantee of UCBVI is adaptive to $|\mathcal{S}^\Pi|$ and independent to $|\mathcal{S}|$, except in the logarithmic terms.

Figures (1)

Figure 1: An illustration of the mapping between the state space $\mathcal{S}$ and the pruned space $\mathcal{S}^\bot$. The left side represents the original state space $\mathcal{S}$, where grey nodes denote the states in $\mathcal{S}^\bot$ and red nodes denote the others. The right side is the corresponding pruned space $\mathcal{S}^\bot$, where blue nodes denote the auxiliary states $\{s_h^\bot\}_{h\in [H]}$. Given the structure, for any trajectory in space $\mathcal{S}$ (purple arrows), we can find a dual trajectory (yellow and green arrows) in the pruned space.

Theorems & Definitions (13)

Definition 3.1
Proposition 4.1
Remark 4.2
Theorem 5.2
Lemma 5.3
Lemma 5.4
Lemma 5.5
Remark 5.6
Lemma 6.1
Theorem 6.2
...and 3 more

State-free Reinforcement Learning

TL;DR

Abstract

State-free Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (13)