Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

Pedro P. Santos; Alberto Sardinha; Francisco S. Melo

Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

Pedro P. Santos, Alberto Sardinha, Francisco S. Melo

TL;DR

This work introduces the first approach to solving infinite-horizon discounted GUMDPs in the single-trial regime, where policy performance is judged from a single trajectory. It establishes that optimality may require non-Markovian (history-dependent) policies, and it recasts the problem as an occupancy MDP, enabling standard MDP planning via stationary policies over running occupancy. An online Monte-Carlo Tree Search method is proposed for the occupancy MDP, with convergence guarantees and a bound showing the regret decays with the horizon $H$. The authors demonstrate NP-hardness of exact single-trial policy optimization, yet show that online planning with truncation yields practical, superior performance over baselines across entropy, imitation, and adversarial tasks in both illustrative and OpenAI Gym environments. This work narrows the gap between single-trajectory objective formulations and scalable planning, with implications for real-world RL where evaluations are often on a single trial.

Abstract

In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.

Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

TL;DR

Abstract

Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (20)