Table of Contents
Fetching ...

Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes

Tetsuro Morimura, Kazuhiro Ota, Kenshi Abe, Peinan Zhang

TL;DR

This paper tackles online reinforcement learning in history-based decision processes (HDPs), where the environment is non-Markovian and simulations may be unavailable. It introduces Monte Carlo Tree Learning (MCTL), an online, simulator-free variant of MCTS, and couples it with policy gradient (PG) methods to form PG-MCTL, a two-timescale stochastic-approximation framework with convergence guarantees. The authors derive conditions under which the PG and MCTL components converge to a stationary, meaningful solution and provide a practical implementation that satisfies these conditions. Empirically, PG-MCTL demonstrates strong performance on HDP tasks, outperforming baselines and a naive mixture, and adaptive mixing further improves performance in tasks with long-horizon dependencies. The work advances online, model-free RL for HDPs by combining the strengths of PG’s function approximation with MCTL’s exploration, with potential impact on tasks like text generation and causal discovery where history matters.

Abstract

Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent. While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues. As another successful RL approach, algorithms based on Monte Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results, especially in the game-playing domain. They are also effective when applied to non-Markov decision processes. However, the standard MCTS is a method for decision-time planning, which differs from the online RL setting. In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL setups. We then explore a combined policy approach of PG and MCTL to leverage their strengths. We derive conditions for asymptotic convergence with the results of a two-timescale stochastic approximation and propose an algorithm that satisfies these conditions and converges to a reasonable solution. Our numerical experiments validate the effectiveness of the proposed methods.

Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes

TL;DR

This paper tackles online reinforcement learning in history-based decision processes (HDPs), where the environment is non-Markovian and simulations may be unavailable. It introduces Monte Carlo Tree Learning (MCTL), an online, simulator-free variant of MCTS, and couples it with policy gradient (PG) methods to form PG-MCTL, a two-timescale stochastic-approximation framework with convergence guarantees. The authors derive conditions under which the PG and MCTL components converge to a stationary, meaningful solution and provide a practical implementation that satisfies these conditions. Empirically, PG-MCTL demonstrates strong performance on HDP tasks, outperforming baselines and a naive mixture, and adaptive mixing further improves performance in tasks with long-horizon dependencies. The work advances online, model-free RL for HDPs by combining the strengths of PG’s function approximation with MCTL’s exploration, with potential impact on tasks like text generation and causal discovery where history matters.

Abstract

Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent. While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues. As another successful RL approach, algorithms based on Monte Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results, especially in the game-playing domain. They are also effective when applied to non-Markov decision processes. However, the standard MCTS is a method for decision-time planning, which differs from the online RL setting. In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL setups. We then explore a combined policy approach of PG and MCTL to leverage their strengths. We derive conditions for asymptotic convergence with the results of a two-timescale stochastic approximation and propose an algorithm that satisfies these conditions and converges to a reasonable solution. Our numerical experiments validate the effectiveness of the proposed methods.
Paper Structure (25 sections, 7 theorems, 72 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 7 theorems, 72 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

Assume Assumptions assum:noise--assum:xybound hold. Let the mixing probability function $\lambda:\mathcal{H}\rightarrow[0,1]$ be invariant to the number of episodes $n$, and the learning rates $\alpha_n$ and $\eta_n$ satisfying Then, almost surely, the sequence $\{(\theta_{n},\omega_{n})\}$ generated by Eqs. eq:x_update_n and eq:y_update_n converges to a compact connected internally chain transi

Figures (3)

  • Figure 1: Overview of the proposed approach; P G guided by Monte Carlo Tree Learning (PG-MCTL). Unlike MCTS, which requires a simulator to generate possible future states and rewards, MCTL builds a tree based on real trajectories experienced by an agent while still inheriting core MCTS properties. PG and MCTL have fundamentally different properties. PG-MCTL takes advantage of them.
  • Figure 2: Performance comparison by ten independent runs, where the error bar represents the standard error of the mean: (a) the randomly synthesized task (${T}=15$). (b) T-maze task; the plot on the left is the result of an easy setting (the length of corridor $L = 30$ and the initial position $s_0=0$). The plot on the right is for a more difficult setting, where there exist more sub-optimal policies (the length of corridor $L = 100$ and the initial position $s_0=50$).
  • Figure 3: Long-term dependency T-maze task: an agent starts at the position $\textsf{S}$. Only at the initial time step $t=0$, it can observe a signal 'up' or 'down' that indicates it should go north or south at the T-junction in this episode.

Theorems & Definitions (7)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Theorem 1
  • Lemma 1
  • Lemma 2
  • Lemma 3