Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes
Tetsuro Morimura, Kazuhiro Ota, Kenshi Abe, Peinan Zhang
TL;DR
This paper tackles online reinforcement learning in history-based decision processes (HDPs), where the environment is non-Markovian and simulations may be unavailable. It introduces Monte Carlo Tree Learning (MCTL), an online, simulator-free variant of MCTS, and couples it with policy gradient (PG) methods to form PG-MCTL, a two-timescale stochastic-approximation framework with convergence guarantees. The authors derive conditions under which the PG and MCTL components converge to a stationary, meaningful solution and provide a practical implementation that satisfies these conditions. Empirically, PG-MCTL demonstrates strong performance on HDP tasks, outperforming baselines and a naive mixture, and adaptive mixing further improves performance in tasks with long-horizon dependencies. The work advances online, model-free RL for HDPs by combining the strengths of PG’s function approximation with MCTL’s exploration, with potential impact on tasks like text generation and causal discovery where history matters.
Abstract
Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent. While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues. As another successful RL approach, algorithms based on Monte Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results, especially in the game-playing domain. They are also effective when applied to non-Markov decision processes. However, the standard MCTS is a method for decision-time planning, which differs from the online RL setting. In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL setups. We then explore a combined policy approach of PG and MCTL to leverage their strengths. We derive conditions for asymptotic convergence with the results of a two-timescale stochastic approximation and propose an algorithm that satisfies these conditions and converges to a reasonable solution. Our numerical experiments validate the effectiveness of the proposed methods.
