Table of Contents
Fetching ...

ConstrainedZero: Chance-Constrained POMDP Planning using Learned Probabilistic Failure Surrogates and Adaptive Safety Constraints

Robert J. Moss, Arec Jamgochian, Johannes Fischer, Anthony Corso, Mykel J. Kochenderfer

TL;DR

ConstrainedZero tackles safe planning under uncertainty by formulating CC-POMDPs and extending BetaZero with a dedicated failure-probability head. It combines offline neural surrogates for value, policy, and failure probability with online Δ-MCTS, using adaptive conformal inference to tune the safety threshold Δ(b) during planning and a CC-PUCT criterion to enforce safety while pursuing high reward. The method explicitly enforces a target safety level Δ0, backing off or tightening adaptively to maintain probabilistic guarantees, and demonstrates superior safety adherence and returns across LightDark, CAS, and CCS benchmarks. This approach offers a scalable, principled way to separate safety constraints from the objective, enabling safer, long-horizon decision-making in complex, uncertain environments with real-world impact.

Abstract

To plan safely in uncertain environments, agents must balance utility with safety constraints. Safe planning problems can be modeled as a chance-constrained partially observable Markov decision process (CC-POMDP) and solutions often use expensive rollouts or heuristics to estimate the optimal value and action-selection policy. This work introduces the ConstrainedZero policy iteration algorithm that solves CC-POMDPs in belief space by learning neural network approximations of the optimal value and policy with an additional network head that estimates the failure probability given a belief. This failure probability guides safe action selection during online Monte Carlo tree search (MCTS). To avoid overemphasizing search based on the failure estimates, we introduce $Δ$-MCTS, which uses adaptive conformal inference to update the failure threshold during planning. The approach is tested on a safety-critical POMDP benchmark, an aircraft collision avoidance system, and the sustainability problem of safe CO$_2$ storage. Results show that by separating safety constraints from the objective we can achieve a target level of safety without optimizing the balance between rewards and costs.

ConstrainedZero: Chance-Constrained POMDP Planning using Learned Probabilistic Failure Surrogates and Adaptive Safety Constraints

TL;DR

ConstrainedZero tackles safe planning under uncertainty by formulating CC-POMDPs and extending BetaZero with a dedicated failure-probability head. It combines offline neural surrogates for value, policy, and failure probability with online Δ-MCTS, using adaptive conformal inference to tune the safety threshold Δ(b) during planning and a CC-PUCT criterion to enforce safety while pursuing high reward. The method explicitly enforces a target safety level Δ0, backing off or tightening adaptively to maintain probabilistic guarantees, and demonstrates superior safety adherence and returns across LightDark, CAS, and CCS benchmarks. This approach offers a scalable, principled way to separate safety constraints from the objective, enabling safer, long-horizon decision-making in complex, uncertain environments with real-world impact.

Abstract

To plan safely in uncertain environments, agents must balance utility with safety constraints. Safe planning problems can be modeled as a chance-constrained partially observable Markov decision process (CC-POMDP) and solutions often use expensive rollouts or heuristics to estimate the optimal value and action-selection policy. This work introduces the ConstrainedZero policy iteration algorithm that solves CC-POMDPs in belief space by learning neural network approximations of the optimal value and policy with an additional network head that estimates the failure probability given a belief. This failure probability guides safe action selection during online Monte Carlo tree search (MCTS). To avoid overemphasizing search based on the failure estimates, we introduce -MCTS, which uses adaptive conformal inference to update the failure threshold during planning. The approach is tested on a safety-critical POMDP benchmark, an aircraft collision avoidance system, and the sustainability problem of safe CO storage. Results show that by separating safety constraints from the objective we can achieve a target level of safety without optimizing the balance between rewards and costs.
Paper Structure (21 sections, 21 equations, 10 figures, 3 algorithms)

This paper contains 21 sections, 21 equations, 10 figures, 3 algorithms.

Figures (10)

  • Figure 1: Elements of ConstrainedZero for CC-POMDP planning.
  • Figure 2: ConstrainedZero online Monte Carlo tree search with failure threshold adaptation ($\Delta$-MCTS).
  • Figure 3: BetaZero($\lambda$) comparison.
  • Figure 4: ConstrainedZero results. Bold indicates the best results within the $\Delta_0$ threshold.
  • Figure 5: Results for the collision avoidance CC-POMDP. \ref{['fig:cas_results_trajs']} matches the "notch" behavior from kochenderfer2012next.
  • ...and 5 more figures