Certifiably Robust Policies for Uncertain Parametric Environments
Yannik Schnitzer, Alessandro Abate, David Parker
TL;DR
This work addresses obtaining provably robust policies when both parameter-valued environments and the induced MDPs are unknown. It jointly models environments as uncertain parametric MDPs (upMDPs) with an unknown distribution $\mathbb{P}$ over parameters, and uses PAC learning of interval MDPs (IMDPs) together with scenario optimization to produce a single PAC guarantee on a policy’s robust performance in unseen environments. The framework supports two learning pathways—robust IMDP policy synthesis and robust meta-reinforcement learning (RoML)—and provides theoretical bounds on risk and performance that can be tuned by discarding outlier samples. Implemented as an extension of PRISM, the approach demonstrates tight performance bounds across multiple benchmarks, validating that certifiably robust policies can be learned even under two layers of uncertainty. Overall, the paper delivers a principled, scalable method for risk-aware policy synthesis with formal guarantees in uncertain parametric environments.
Abstract
We present a data-driven approach for producing policies that are provably robust across unknown stochastic environments. Existing approaches can learn models of a single environment as an interval Markov decision processes (IMDP) and produce a robust policy with a probably approximately correct (PAC) guarantee on its performance. However these are unable to reason about the impact of environmental parameters underlying the uncertainty. We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters. We learn and analyse IMDPs for a set of unknown sample environments induced by parameters. The key challenge is then to produce meaningful performance guarantees that combine the two layers of uncertainty: (1) multiple environments induced by parameters with an unknown distribution; (2) unknown induced environments which are approximated by IMDPs. We present a novel approach based on scenario optimisation that yields a single PAC guarantee quantifying the risk level for which a specified performance level can be assured in unseen environments, plus a means to trade-off risk and performance. We implement and evaluate our framework using multiple robust policy generation methods on a range of benchmarks. We show that our approach produces tight bounds on a policy's performance with high confidence.
