Table of Contents
Fetching ...

Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

Suei-Wen Chen, Keith Ross, Pierre Youssef

TL;DR

A finite sample bound is developed for a modified MCES algorithm which solves the stochastic shortest path problem and proves a novel result on the convergence rate of the policy iteration algorithm.

Abstract

Monte Carlo Exploring Starts (MCES), which aims to learn the optimal policy using only sample returns, is a simple and natural algorithm in reinforcement learning which has been shown to converge under various conditions. However, the convergence rate analysis for MCES-style algorithms in the form of sample complexity has received very little attention. In this paper we develop a finite sample bound for a modified MCES algorithm which solves the stochastic shortest path problem. To this end, we prove a novel result on the convergence rate of the policy iteration algorithm. This result implies that with probability at least $1-δ$, the algorithm returns an optimal policy after $\tilde{O}(SAK^3\log^3\frac{1}δ)$ sampled episodes, where $S$ and $A$ denote the number of states and actions respectively, $K$ is a proxy for episode length, and $\tilde{O}$ hides logarithmic factors and constants depending on the rewards of the environment that are assumed to be known.

Finite-Sample Analysis of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

TL;DR

A finite sample bound is developed for a modified MCES algorithm which solves the stochastic shortest path problem and proves a novel result on the convergence rate of the policy iteration algorithm.

Abstract

Monte Carlo Exploring Starts (MCES), which aims to learn the optimal policy using only sample returns, is a simple and natural algorithm in reinforcement learning which has been shown to converge under various conditions. However, the convergence rate analysis for MCES-style algorithms in the form of sample complexity has received very little attention. In this paper we develop a finite sample bound for a modified MCES algorithm which solves the stochastic shortest path problem. To this end, we prove a novel result on the convergence rate of the policy iteration algorithm. This result implies that with probability at least , the algorithm returns an optimal policy after sampled episodes, where and denote the number of states and actions respectively, is a proxy for episode length, and hides logarithmic factors and constants depending on the rewards of the environment that are assumed to be known.
Paper Structure (16 sections, 6 theorems, 45 equations)

This paper contains 16 sections, 6 theorems, 45 equations.

Key Result

Theorem 1

For any $\delta\in (0,1)$, with $L=L_\star$ improvement steps and $N=N(\delta)$ episodes per state-action pair between improvement steps, the resulting policy $\pi$ given by Algorithm alg:mces-modified satisfies

Theorems & Definitions (12)

  • Theorem 1
  • Corollary 1
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 2
  • proof
  • proof : Proof of Theorem \ref{['theorem:main']}
  • proof : Proof of Corollary \ref{['corollary:main']}
  • proof
  • ...and 2 more