Table of Contents
Fetching ...

On the Convergence of Monte Carlo UCB for Random-Length Episodic MDPs

Zixuan Dong, Che Wang, Keith Ross

TL;DR

MC-UCB convergence for random-length episodic MDPs is studied; the paper proves almost-sure convergence for $OPFF$ MDPs, implying convergence for finite-horizon MDPs as a corollary. The main technique is a backward-induction proof on a DAG induced by optimal transitions, aided by a termination mechanism to avoid loops and first-visit sampling. Empirical results on Blackjack and Cliff-Walking demonstrate competitive policy and value convergence, validating the theoretical guarantees in practice. The work provides a bridge between MC-based exploration and convergence theory in realistic episodic tasks.

Abstract

In reinforcement learning, Monte Carlo algorithms update the Q function by averaging the episodic returns. In the Monte Carlo UCB (MC-UCB) algorithm, the action taken in each state is the action that maximizes the Q function plus an Upper Confidence Bounds (UCB) exploration term, which biases the choice of actions to those that have been chosen less frequently. Although there has been significant work on establishing regret bounds for MC-UCB, most of that work has been focused on finite-horizon versions of the problem, for which each episode terminates after a constant number of steps. For such finite-horizon problems, the optimal policy depends both on the current state and the time within the episode. However, for many natural episodic problems, such as games like Go and Chess and robotic tasks, the episode is of random length and the optimal policy is stationary. For such environments, it is an open question whether the Q-function in MC-UCB will converge to the optimal Q function; we conjecture that, unlike Q-learning, it does not converge for all MDPs. We nevertheless show that for a large class of MDPs, which includes stochastic MDPs such as blackjack and deterministic MDPs such as Go, the Q function in MC-UCB converges almost surely to the optimal Q function. An immediate corollary of this result is that it also converges almost surely for all finite-horizon MDPs. We also provide numerical experiments, providing further insights into MC-UCB.

On the Convergence of Monte Carlo UCB for Random-Length Episodic MDPs

TL;DR

MC-UCB convergence for random-length episodic MDPs is studied; the paper proves almost-sure convergence for MDPs, implying convergence for finite-horizon MDPs as a corollary. The main technique is a backward-induction proof on a DAG induced by optimal transitions, aided by a termination mechanism to avoid loops and first-visit sampling. Empirical results on Blackjack and Cliff-Walking demonstrate competitive policy and value convergence, validating the theoretical guarantees in practice. The work provides a bridge between MC-based exploration and convergence theory in realistic episodic tasks.

Abstract

In reinforcement learning, Monte Carlo algorithms update the Q function by averaging the episodic returns. In the Monte Carlo UCB (MC-UCB) algorithm, the action taken in each state is the action that maximizes the Q function plus an Upper Confidence Bounds (UCB) exploration term, which biases the choice of actions to those that have been chosen less frequently. Although there has been significant work on establishing regret bounds for MC-UCB, most of that work has been focused on finite-horizon versions of the problem, for which each episode terminates after a constant number of steps. For such finite-horizon problems, the optimal policy depends both on the current state and the time within the episode. However, for many natural episodic problems, such as games like Go and Chess and robotic tasks, the episode is of random length and the optimal policy is stationary. For such environments, it is an open question whether the Q-function in MC-UCB will converge to the optimal Q function; we conjecture that, unlike Q-learning, it does not converge for all MDPs. We nevertheless show that for a large class of MDPs, which includes stochastic MDPs such as blackjack and deterministic MDPs such as Go, the Q function in MC-UCB converges almost surely to the optimal Q function. An immediate corollary of this result is that it also converges almost surely for all finite-horizon MDPs. We also provide numerical experiments, providing further insights into MC-UCB.
Paper Structure (11 sections, 2 theorems, 25 equations, 2 figures, 2 algorithms)

This paper contains 11 sections, 2 theorems, 25 equations, 2 figures, 2 algorithms.

Key Result

Theorem 1

Suppose the MDP is OPFF. Then $V_n(s)$ converges to $V^*(s)$$w.p.1$, for all $s \in {\mathcal{S}}$; $Q_n(s,a)$ converges to $Q^*(s,a)$$w.p.1$, for all $s \in {\mathcal{S}}$ and $a\in{\mathcal{A}}$.

Figures (2)

  • Figure 1: Experiments running on the Blackjack. The x-axis is the total number of training episodes and the y-axis shows performance, policy convergence, and Q-value convergence respectively.
  • Figure 2: Experiments running on the Cliff-Walking. The x-axis is the total number of training episodes and the y-axis shows performance, V-value convergence, and Q-value convergence respectively.

Theorems & Definitions (4)

  • Theorem 1: Almost sure convergence of MC-UCB for OPFF MDPs
  • proof
  • Lemma 1: Almost Sure Convergence for the MAB with UCB1
  • proof