Table of Contents
Fetching ...

Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

Pedro P. Santos, Alberto Sardinha, Francisco S. Melo

TL;DR

This work introduces the first approach to solving infinite-horizon discounted GUMDPs in the single-trial regime, where policy performance is judged from a single trajectory. It establishes that optimality may require non-Markovian (history-dependent) policies, and it recasts the problem as an occupancy MDP, enabling standard MDP planning via stationary policies over running occupancy. An online Monte-Carlo Tree Search method is proposed for the occupancy MDP, with convergence guarantees and a bound showing the regret decays with the horizon $H$. The authors demonstrate NP-hardness of exact single-trial policy optimization, yet show that online planning with truncation yields practical, superior performance over baselines across entropy, imitation, and adversarial tasks in both illustrative and OpenAI Gym environments. This work narrows the gap between single-trajectory objective formulations and scalable planning, with implications for real-world RL where evaluations are often on a single trial.

Abstract

In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.

Solving General-Utility Markov Decision Processes in the Single-Trial Regime with Online Planning

TL;DR

This work introduces the first approach to solving infinite-horizon discounted GUMDPs in the single-trial regime, where policy performance is judged from a single trajectory. It establishes that optimality may require non-Markovian (history-dependent) policies, and it recasts the problem as an occupancy MDP, enabling standard MDP planning via stationary policies over running occupancy. An online Monte-Carlo Tree Search method is proposed for the occupancy MDP, with convergence guarantees and a bound showing the regret decays with the horizon . The authors demonstrate NP-hardness of exact single-trial policy optimization, yet show that online planning with truncation yields practical, superior performance over baselines across entropy, imitation, and adversarial tasks in both illustrative and OpenAI Gym environments. This work narrows the gap between single-trajectory objective formulations and scalable planning, with implications for real-world RL where evaluations are often on a single trial.

Abstract

In this work, we contribute the first approach to solve infinite-horizon discounted general-utility Markov decision processes (GUMDPs) in the single-trial regime, i.e., when the agent's performance is evaluated based on a single trajectory. First, we provide some fundamental results regarding policy optimization in the single-trial regime, investigating which class of policies suffices for optimality, casting our problem as a particular MDP that is equivalent to our original problem, as well as studying the computational hardness of policy optimization in the single-trial regime. Second, we show how we can leverage online planning techniques, in particular a Monte-Carlo tree search algorithm, to solve GUMDPs in the single-trial regime. Third, we provide experimental results showcasing the superior performance of our approach in comparison to relevant baselines.

Paper Structure

This paper contains 45 sections, 9 theorems, 54 equations, 13 figures, 2 tables.

Key Result

Theorem 1

There exists a GUMDP $\mathcal{M}_f$ with $\gamma \in (0,1)$ and $L$-Lipschitz convex objective such that:

Figures (13)

  • Figure 1: Illustrative GUMDPs. $\mathcal{M}_{f,1}$ and $\mathcal{M}_{f,3}$ share the same dynamics but differ in the objective function. In all GUMDPs, the chosen action succeeds with $90\%$ probability and, with $10\%$ probability, the agent randomly moves to any of the states. The behavior policy for $\mathcal{M}_{f,2}$ is $\beta(a_0|s_0) = 0.8$ and $\beta(a_0|s_1) = 0.2$. In (c), we plot the three cost functions, $c_1, c_2$ and $c_3$, of the adversarial MDP.
  • Figure 2: Illustration of the GUMDP used in the proof of Theo. \ref{['theo:classes_of_policies']} with $\mathcal{S} = \{s^0, s^1, s^2\}$ and $\mathcal{A} = \{a^1,a^2\}$. The distribution of initial states is $p_0(s^0)=0, p_0(s^1) = \epsilon, p_0(s^2) = 1 - \epsilon$, where we set $\epsilon = 1/2$. All transitions are deterministic and in states $s^1$ and $s^2$ any of the actions takes the agent back to state $s^0$.
  • Figure 3: Illustration of objectives $F_{1}$ and $F_{1,H}$, as well as the relation between different quantities of interest for the proof.
  • Figure 4: GUMDP instance used in the NP-Hardness proof.
  • Figure 5: Maximum state entropy exploration, $\mathcal{M}_{f,1}$: (a) - Mean single-trial objective $F_{1,H}(\pi)$ obtained by different policies. Error bars correspond to the $90\%$ mean confidence interval. (b) - Distribution of the single-trial objective $F_{1,H}(\pi)$ obtained by different policies. (c) - Mean single-trial objective $F_{1,H}(\pi)$ obtained by the MCTS-based algorithm as a function of the number of expansion steps. Shaded areas correspond to the $90\%$ mean confidence interval. Across all plots, lower is better.
  • ...and 8 more figures

Theorems & Definitions (20)

  • Theorem 1
  • Proposition 1: Regret decomposition
  • Proposition 2: One-to-one mapping between histories in $\mathcal{M}_f$ and states in $\mathcal{M}_\text{O}$
  • Theorem 2: Solving $\mathcal{M}_f$ is "equivalent" to solving $\mathcal{M}_\text{O}$
  • Remark 1: Deterministic policies suffice for optimality
  • Theorem 3: NP-Hardness of policy optimization in the single-trial regime
  • Remark 2
  • proof
  • Lemma 1
  • proof
  • ...and 10 more