Table of Contents
Fetching ...

Threshold UCT: Cost-Constrained Monte Carlo Tree Search with Pareto Curves

Martin Kurečka, Václav Nevyhoštěný, Petr Novotný, Vít Unčovský

TL;DR

This work advances safe planning under uncertainty by introducing Threshold UCT (T-UCT) for constrained CMDPs. By online-estimating Pareto curves of cost–payoff trade-offs and integrating them into both search and action selection, T-UCT achieves a balanced, data-efficient approach to obtaining safe yet valuable policies. Theoretical results guarantee ε-soundness under sufficient search, and empirical evaluations on Gridworld and Manhattan benchmarks show T-UCT outperforms state-of-the-art baselines in safety and payoff, especially in larger, more complex environments. The approach provides a scalable, principled framework for cost-constrained planning with potential extensions to learning Pareto-set representations.

Abstract

Constrained Markov decision processes (CMDPs), in which the agent optimizes expected payoffs while keeping the expected cost below a given threshold, are the leading framework for safe sequential decision making under stochastic uncertainty. Among algorithms for planning and learning in CMDPs, methods based on Monte Carlo tree search (MCTS) have particular importance due to their efficiency and extendibility to more complex frameworks (such as partially observable settings and games). However, current MCTS-based methods for CMDPs either struggle with finding safe (i.e., constraint-satisfying) policies, or are too conservative and do not find valuable policies. We introduce Threshold UCT (T-UCT), an online MCTS-based algorithm for CMDP planning. Unlike previous MCTS-based CMDP planners, T-UCT explicitly estimates Pareto curves of cost-utility trade-offs throughout the search tree, using these together with a novel action selection and threshold update rules to seek safe and valuable policies. Our experiments demonstrate that our approach significantly outperforms state-of-the-art methods from the literature.

Threshold UCT: Cost-Constrained Monte Carlo Tree Search with Pareto Curves

TL;DR

This work advances safe planning under uncertainty by introducing Threshold UCT (T-UCT) for constrained CMDPs. By online-estimating Pareto curves of cost–payoff trade-offs and integrating them into both search and action selection, T-UCT achieves a balanced, data-efficient approach to obtaining safe yet valuable policies. Theoretical results guarantee ε-soundness under sufficient search, and empirical evaluations on Gridworld and Manhattan benchmarks show T-UCT outperforms state-of-the-art baselines in safety and payoff, especially in larger, more complex environments. The approach provides a scalable, principled framework for cost-constrained planning with potential extensions to learning Pareto-set representations.

Abstract

Constrained Markov decision processes (CMDPs), in which the agent optimizes expected payoffs while keeping the expected cost below a given threshold, are the leading framework for safe sequential decision making under stochastic uncertainty. Among algorithms for planning and learning in CMDPs, methods based on Monte Carlo tree search (MCTS) have particular importance due to their efficiency and extendibility to more complex frameworks (such as partially observable settings and games). However, current MCTS-based methods for CMDPs either struggle with finding safe (i.e., constraint-satisfying) policies, or are too conservative and do not find valuable policies. We introduce Threshold UCT (T-UCT), an online MCTS-based algorithm for CMDP planning. Unlike previous MCTS-based CMDP planners, T-UCT explicitly estimates Pareto curves of cost-utility trade-offs throughout the search tree, using these together with a novel action selection and threshold update rules to seek safe and valuable policies. Our experiments demonstrate that our approach significantly outperforms state-of-the-art methods from the literature.

Paper Structure

This paper contains 41 sections, 5 theorems, 34 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Let $\mathcal{C}$ be a CMDP and $\Delta$ a threshold such that there exists a $\Delta$-feasible policy. Then for every $\varepsilon \in \mathbb{R}^+$, there exists $n$ such that T-UCT with $n$ MCTS iterations per action selection is $(\Delta + \varepsilon)$-feasible.

Figures (4)

  • Figure 1: The Gridworld and Manhattan environments.
  • Figure 2: The fractions of satisfied configurations in the mean (plain) and the weak (dotted) sense. Upper row: The overall fraction of satisfied instances across varied time limits. Lower row: Breakdown of the fractions of satisfied instances across varied thresholds with the time limit set to the maximum value ($25$, $50$, or $500$, respectively).
  • Figure 3: Mean payoff of T-UCT compared to each baseline. The average is calculated only over instances satisfied (in the weak sense) by both of the considered algorithms.
  • Figure 4: CMDP A

Theorems & Definitions (11)

  • Definition 1
  • Theorem 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • ...and 1 more