Table of Contents
Fetching ...

Certifiably Robust Policies for Uncertain Parametric Environments

Yannik Schnitzer, Alessandro Abate, David Parker

TL;DR

This work addresses obtaining provably robust policies when both parameter-valued environments and the induced MDPs are unknown. It jointly models environments as uncertain parametric MDPs (upMDPs) with an unknown distribution $\mathbb{P}$ over parameters, and uses PAC learning of interval MDPs (IMDPs) together with scenario optimization to produce a single PAC guarantee on a policy’s robust performance in unseen environments. The framework supports two learning pathways—robust IMDP policy synthesis and robust meta-reinforcement learning (RoML)—and provides theoretical bounds on risk and performance that can be tuned by discarding outlier samples. Implemented as an extension of PRISM, the approach demonstrates tight performance bounds across multiple benchmarks, validating that certifiably robust policies can be learned even under two layers of uncertainty. Overall, the paper delivers a principled, scalable method for risk-aware policy synthesis with formal guarantees in uncertain parametric environments.

Abstract

We present a data-driven approach for producing policies that are provably robust across unknown stochastic environments. Existing approaches can learn models of a single environment as an interval Markov decision processes (IMDP) and produce a robust policy with a probably approximately correct (PAC) guarantee on its performance. However these are unable to reason about the impact of environmental parameters underlying the uncertainty. We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters. We learn and analyse IMDPs for a set of unknown sample environments induced by parameters. The key challenge is then to produce meaningful performance guarantees that combine the two layers of uncertainty: (1) multiple environments induced by parameters with an unknown distribution; (2) unknown induced environments which are approximated by IMDPs. We present a novel approach based on scenario optimisation that yields a single PAC guarantee quantifying the risk level for which a specified performance level can be assured in unseen environments, plus a means to trade-off risk and performance. We implement and evaluate our framework using multiple robust policy generation methods on a range of benchmarks. We show that our approach produces tight bounds on a policy's performance with high confidence.

Certifiably Robust Policies for Uncertain Parametric Environments

TL;DR

This work addresses obtaining provably robust policies when both parameter-valued environments and the induced MDPs are unknown. It jointly models environments as uncertain parametric MDPs (upMDPs) with an unknown distribution over parameters, and uses PAC learning of interval MDPs (IMDPs) together with scenario optimization to produce a single PAC guarantee on a policy’s robust performance in unseen environments. The framework supports two learning pathways—robust IMDP policy synthesis and robust meta-reinforcement learning (RoML)—and provides theoretical bounds on risk and performance that can be tuned by discarding outlier samples. Implemented as an extension of PRISM, the approach demonstrates tight performance bounds across multiple benchmarks, validating that certifiably robust policies can be learned even under two layers of uncertainty. Overall, the paper delivers a principled, scalable method for risk-aware policy synthesis with formal guarantees in uncertain parametric environments.

Abstract

We present a data-driven approach for producing policies that are provably robust across unknown stochastic environments. Existing approaches can learn models of a single environment as an interval Markov decision processes (IMDP) and produce a robust policy with a probably approximately correct (PAC) guarantee on its performance. However these are unable to reason about the impact of environmental parameters underlying the uncertainty. We propose a framework based on parametric Markov decision processes (MDPs) with unknown distributions over parameters. We learn and analyse IMDPs for a set of unknown sample environments induced by parameters. The key challenge is then to produce meaningful performance guarantees that combine the two layers of uncertainty: (1) multiple environments induced by parameters with an unknown distribution; (2) unknown induced environments which are approximated by IMDPs. We present a novel approach based on scenario optimisation that yields a single PAC guarantee quantifying the risk level for which a specified performance level can be assured in unseen environments, plus a means to trade-off risk and performance. We implement and evaluate our framework using multiple robust policy generation methods on a range of benchmarks. We show that our approach produces tight bounds on a policy's performance with high confidence.
Paper Structure (7 sections, 2 theorems, 10 equations, 4 figures)

This paper contains 7 sections, 2 theorems, 10 equations, 4 figures.

Key Result

lemma \ooalign $\m@tht\cup$\cr $t\cup$ @=0$\m@tht\cdot$ @ by -0@=.5@ \hidewidth@\hidewidth mcounterlemma

The true, unknown MDP $\mathcal{M}[\theta_i]$ is contained in its IMDP overapproximation $\mathcal{M}^\gamma[\theta_i]$ with probability at least $1-\gamma$. $\sqcap$$\sqcup$=0

Figures (4)

  • Figure 1: Example parametric environment with induced performance function.
  • Figure 2: For a fixed policy $\pi$, $J(\pi,\theta)$ is a random variable over performance values with measure $\mathbb{P}$ over valuations $\theta \in \Theta$ (left). We sample performances to bound the risk $r(\pi, \tilde{J})$, i.e., the probability for $J$ to take a value less than $\tilde{J}$ (right).
  • Figure 3: Overview of our framework to derive performance and risk guarantees for policies learned on upMDPs. The setup includes two layers of uncertainty: we sample and analyse unknown environments from an unknown distribution.
  • Figure 4: Example risk bounds obtained from Theorem \ref{['thm:bound']} (left) and Theorem \ref{['thm:bounddiscard']} (right) for IMDP confidence $\gamma = 10^{-4}$. For Theorem \ref{['thm:bounddiscard']}, 5% of samples are discarded.

Theorems & Definitions (7)

  • definition \ooalign $\m@tht\cup$\cr $t\cup$ @=0$\m@tht\cdot$ @ by -0@=.5@ \hidewidth@\hidewidth mcounterdefinition: Parametric MDP
  • definition \ooalign $\m@tht\cup$\cr $t\cup$ @=0$\m@tht\cdot$ @ by -0@=.5@ \hidewidth@\hidewidth mcounterdefinition: Uncertain Parametric MDP
  • definition \ooalign $\m@tht\cup$\cr $t\cup$ @=0$\m@tht\cdot$ @ by -0@=.5@ \hidewidth@\hidewidth mcounterdefinition: Evaluation Function
  • definition \ooalign $\m@tht\cup$\cr $t\cup$ @=0$\m@tht\cdot$ @ by -0@=.5@ \hidewidth@\hidewidth mcounterdefinition: Violation Risk
  • definition \ooalign $\m@tht\cup$\cr $t\cup$ @=0$\m@tht\cdot$ @ by -0@=.5@ \hidewidth@\hidewidth mcounterdefinition: Interval MDP
  • lemma \ooalign $\m@tht\cup$\cr $t\cup$ @=0$\m@tht\cdot$ @ by -0@=.5@ \hidewidth@\hidewidth mcounterlemma: DBLP:journals/corr/meggendorfer24
  • theorem \ooalign $\m@tht\cup$\cr $t\cup$ @=0$\m@tht\cdot$ @ by -0@=.5@ \hidewidth@\hidewidth mcountertheorem