Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

Suei-Wen Chen; Keith Ross; Pierre Youssef

Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

Suei-Wen Chen, Keith Ross, Pierre Youssef

TL;DR

A finite sample bound is developed for a modified MCES algorithm which solves the stochastic shortest path problem and proves a novel result on the convergence rate of the policy iteration algorithm.

Abstract

Monte Carlo Exploring Starts (MCES), which aims to learn the optimal policy using only sample returns, is a simple and natural algorithm in reinforcement learning which has been shown to converge under various conditions. However, the convergence rate analysis for MCES-style algorithms in the form of sample complexity has received very little attention. In this paper we develop a finite sample bound for a modified MCES algorithm which solves the stochastic shortest path problem. To this end, we prove a novel result on the convergence rate of the policy iteration algorithm. This result implies that with probability at least $1-δ$, the algorithm returns an optimal policy after $\tilde{O}(SAK^3\log^3\frac{1}δ)$ sampled episodes, where $S$ and $A$ denote the number of states and actions respectively, $K$ is a proxy for episode length, and $\tilde{O}$ hides logarithmic factors and constants depending on the rewards of the environment that are assumed to be known.

Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

TL;DR

A finite sample bound is developed for a modified MCES algorithm which solves the stochastic shortest path problem and proves a novel result on the convergence rate of the policy iteration algorithm.

Abstract

, the algorithm returns an optimal policy after

sampled episodes, where

and

denote the number of states and actions respectively,

is a proxy for episode length, and

hides logarithmic factors and constants depending on the rewards of the environment that are assumed to be known.

Paper Structure (16 sections, 6 theorems, 45 equations)

This paper contains 16 sections, 6 theorems, 45 equations.

Introduction
Related Work
Preliminaries
Markov Decision Processes (MDPs)
Value Functions
Contraction Structure of Episodic MDPs
Comparison between different MDP settings
Suboptimality gaps
Finite Sample Analysis of MCES Variants
Main Results
Comparison with Existing Results and the Finite-horizon Setting
Conclusion
Proofs of Lemmas
Proof of Lemma \ref{['lemma:uniform-bound-on-transition-matrix']}
Proof of Lemma \ref{['lemma:subexp-bound-on-absorption-time']}
...and 1 more sections

Key Result

Theorem 1

For any $\delta\in (0,1)$, with $L=L_\star$ improvement steps and $N=N(\delta)$ episodes per state-action pair between improvement steps, the resulting policy $\pi$ given by Algorithm alg:mces-modified satisfies

Theorems & Definitions (12)

Theorem 1
Corollary 1
Lemma 1
Lemma 2
Lemma 3
Theorem 2
proof
proof : Proof of Theorem \ref{['theorem:main']}
proof : Proof of Corollary \ref{['corollary:main']}
proof
...and 2 more

Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

TL;DR

Abstract

Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (12)