Table of Contents
Fetching ...

Percentile Criterion Optimization in Offline Reinforcement Learning

Elita A. Lobo, Cyrus Cousins, Yair Zick, Marek Petrik

TL;DR

This work tackles percentile-criterion optimization in offline/batch RL under model uncertainty. It introduces a Value-at-Risk (VaR) dynamic programming framework via a VaR Bellman operator, which contracts and provides a provable lower bound on the percentile objective without explicit ambiguity-set construction. The authors prove finite-sample and asymptotic performance guarantees, derive a generalized VaR value iteration, and show that the implicit VaR ambiguity sets are asymptotically smaller than Bayesian credible region-based sets. Empirically, the VaR framework yields tighter robustness guarantees and less conservative policies across several domains, with some exceptions at higher confidence levels where alternative robust methods can excel. The approach offers a principled, scalable alternative to Bayesian/credible-set strategies for risk-sensitive policy learning in data-scarce, high-stakes settings.

Abstract

In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the \emph{percentile criterion}. The percentile criterion is approximately solved by constructing an \emph{ambiguity set} that contains the true model with high probability and optimizing the policy for the worst model in the set. Since the percentile criterion is non-convex, constructing ambiguity sets is often challenging. Existing work uses \emph{Bayesian credible regions} as ambiguity sets, but they are often unnecessarily large and result in learning overly conservative policies. To overcome these shortcomings, we propose a novel Value-at-Risk based dynamic programming algorithm to optimize the percentile criterion without explicitly constructing any ambiguity sets. Our theoretical and empirical results show that our algorithm implicitly constructs much smaller ambiguity sets and learns less conservative robust policies.

Percentile Criterion Optimization in Offline Reinforcement Learning

TL;DR

This work tackles percentile-criterion optimization in offline/batch RL under model uncertainty. It introduces a Value-at-Risk (VaR) dynamic programming framework via a VaR Bellman operator, which contracts and provides a provable lower bound on the percentile objective without explicit ambiguity-set construction. The authors prove finite-sample and asymptotic performance guarantees, derive a generalized VaR value iteration, and show that the implicit VaR ambiguity sets are asymptotically smaller than Bayesian credible region-based sets. Empirically, the VaR framework yields tighter robustness guarantees and less conservative policies across several domains, with some exceptions at higher confidence levels where alternative robust methods can excel. The approach offers a principled, scalable alternative to Bayesian/credible-set strategies for risk-sensitive policy learning in data-scarce, high-stakes settings.

Abstract

In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the \emph{percentile criterion}. The percentile criterion is approximately solved by constructing an \emph{ambiguity set} that contains the true model with high probability and optimizing the policy for the worst model in the set. Since the percentile criterion is non-convex, constructing ambiguity sets is often challenging. Existing work uses \emph{Bayesian credible regions} as ambiguity sets, but they are often unnecessarily large and result in learning overly conservative policies. To overcome these shortcomings, we propose a novel Value-at-Risk based dynamic programming algorithm to optimize the percentile criterion without explicitly constructing any ambiguity sets. Our theoretical and empirical results show that our algorithm implicitly constructs much smaller ambiguity sets and learns less conservative robust policies.
Paper Structure (32 sections, 25 theorems, 101 equations, 2 figures, 3 tables, 2 algorithms)

This paper contains 32 sections, 25 theorems, 101 equations, 2 figures, 3 tables, 2 algorithms.

Key Result

Proposition 3.0

The following properties of $\mathcal{T}_{\mathop{\mathrm{\mathrm{VaR}_{\alpha}}}\nolimits}$ hold for all value functions $\bm u, \bm v\in \mathbb{R}^{\mathcal{S}}$.

Figures (2)

  • Figure 1: \ref{['fig:radii']} (left) compares the asymptotic radius of $\Sigma^{-1}$-Minkowski norm $\textrm{BCR} \xspace$ ambiguity sets to $\mathop{\mathrm{\mathrm{VaR}_{\alpha}}}\nolimits$ ambiguity sets, where $\Sigma$ is the covariance matrix. The size of the $\textrm{BCR} \xspace$ ambiguity sets significantly grows with the number of states. \ref{['fig:ambiguity_sets']} and \ref{['fig:ambiguity_sets1']} (right) compare the asymptotic forms of $\textrm{BCR}_{\alpha} \xspace$ and $\mathop{\mathrm{\mathrm{VaR}_{\alpha}}}\nolimits$ ambiguity sets under high and low uncertainty in $\tilde{\bm{P}}$ at confidence level $\alpha=0.2$ respectively. The grey dots represent the transition probabilities samples from a Dirichlet distribution. The sizes of $\textrm{BCR}_{\alpha} \xspace$ and $\mathop{\mathrm{\mathrm{VaR}_{\alpha}}}\nolimits$ ambiguity sets increase with an increase in the uncertainty in $\tilde{\bm{P}}$, however, the $\mathop{\mathrm{\mathrm{VaR}_{\alpha}}}\nolimits$ ambiguity sets are smaller than the $\textrm{BCR}_{\alpha} \xspace$ ambiguity sets.
  • Figure 2: Comparison of test and train robust returns achieved by VaR, VaRN, BCR$\ell_1$, BCR$\ell_\infty$, WBCR$\ell_1$, WBCR, Soft Robust, Naive Hoeffding and Opt Hoeffding agents at confidence level $\delta=0.05$ in Riverswim, Inventory, Population-Small and Population domain. $\mathrm{VaR}$ framework achieves the highest mean robust returns in most of the domains on test and train datasets.

Theorems & Definitions (41)

  • Example 2.1
  • Proposition 3.0: Validity
  • Proposition 3.0: Lower Bound Percentile Criterion
  • Proposition 3.0
  • Theorem 3.1: Performance
  • Theorem 3.2: Asymptotic Performance
  • Proposition 3.2: Empirical Error Bound
  • Proposition 3.2: Value Iteration Error
  • Proposition 3.2: Time Complexity
  • Proposition 4.0: Equivalence
  • ...and 31 more