Percentile Criterion Optimization in Offline Reinforcement Learning

Elita A. Lobo; Cyrus Cousins; Yair Zick; Marek Petrik

Percentile Criterion Optimization in Offline Reinforcement Learning

Elita A. Lobo, Cyrus Cousins, Yair Zick, Marek Petrik

TL;DR

This work tackles percentile-criterion optimization in offline/batch RL under model uncertainty. It introduces a Value-at-Risk (VaR) dynamic programming framework via a VaR Bellman operator, which contracts and provides a provable lower bound on the percentile objective without explicit ambiguity-set construction. The authors prove finite-sample and asymptotic performance guarantees, derive a generalized VaR value iteration, and show that the implicit VaR ambiguity sets are asymptotically smaller than Bayesian credible region-based sets. Empirically, the VaR framework yields tighter robustness guarantees and less conservative policies across several domains, with some exceptions at higher confidence levels where alternative robust methods can excel. The approach offers a principled, scalable alternative to Bayesian/credible-set strategies for risk-sensitive policy learning in data-scarce, high-stakes settings.

Abstract

In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the \emph{percentile criterion}. The percentile criterion is approximately solved by constructing an \emph{ambiguity set} that contains the true model with high probability and optimizing the policy for the worst model in the set. Since the percentile criterion is non-convex, constructing ambiguity sets is often challenging. Existing work uses \emph{Bayesian credible regions} as ambiguity sets, but they are often unnecessarily large and result in learning overly conservative policies. To overcome these shortcomings, we propose a novel Value-at-Risk based dynamic programming algorithm to optimize the percentile criterion without explicitly constructing any ambiguity sets. Our theoretical and empirical results show that our algorithm implicitly constructs much smaller ambiguity sets and learns less conservative robust policies.

Percentile Criterion Optimization in Offline Reinforcement Learning

TL;DR

Abstract

Paper Structure (32 sections, 25 theorems, 101 equations, 2 figures, 3 tables, 2 algorithms)

This paper contains 32 sections, 25 theorems, 101 equations, 2 figures, 3 tables, 2 algorithms.

Introduction
Our Contributions
Related Work
Preliminaries
Percentile Criterion
Robust MDPs
VaR Framework
Performance Guarantees
Dynamic Programming Algorithm
Comparison with Bayesian Credible Regions
Experiments
Implementation details:
Experimental Results
Conclusion and Future Work
Additional theoretical results
...and 17 more sections

Key Result

Proposition 3.0

The following properties of $\mathcal{T}_{\mathop{\mathrm{\mathrm{VaR}_{\alpha}}}\nolimits}$ hold for all value functions $\bm u, \bm v\in \mathbb{R}^{\mathcal{S}}$.

Figures (2)

Figure 1: \ref{['fig:radii']} (left) compares the asymptotic radius of $\Sigma^{-1}$-Minkowski norm $\textrm{BCR} \xspace$ ambiguity sets to $\mathop{\mathrm{\mathrm{VaR}_{\alpha}}}\nolimits$ ambiguity sets, where $\Sigma$ is the covariance matrix. The size of the $\textrm{BCR} \xspace$ ambiguity sets significantly grows with the number of states. \ref{['fig:ambiguity_sets']} and \ref{['fig:ambiguity_sets1']} (right) compare the asymptotic forms of $\textrm{BCR}_{\alpha} \xspace$ and $\mathop{\mathrm{\mathrm{VaR}_{\alpha}}}\nolimits$ ambiguity sets under high and low uncertainty in $\tilde{\bm{P}}$ at confidence level $\alpha=0.2$ respectively. The grey dots represent the transition probabilities samples from a Dirichlet distribution. The sizes of $\textrm{BCR}_{\alpha} \xspace$ and $\mathop{\mathrm{\mathrm{VaR}_{\alpha}}}\nolimits$ ambiguity sets increase with an increase in the uncertainty in $\tilde{\bm{P}}$, however, the $\mathop{\mathrm{\mathrm{VaR}_{\alpha}}}\nolimits$ ambiguity sets are smaller than the $\textrm{BCR}_{\alpha} \xspace$ ambiguity sets.
Figure 2: Comparison of test and train robust returns achieved by VaR, VaRN, BCR$\ell_1$, BCR$\ell_\infty$, WBCR$\ell_1$, WBCR, Soft Robust, Naive Hoeffding and Opt Hoeffding agents at confidence level $\delta=0.05$ in Riverswim, Inventory, Population-Small and Population domain. $\mathrm{VaR}$ framework achieves the highest mean robust returns in most of the domains on test and train datasets.

Theorems & Definitions (41)

Example 2.1
Proposition 3.0: Validity
Proposition 3.0: Lower Bound Percentile Criterion
Proposition 3.0
Theorem 3.1: Performance
Theorem 3.2: Asymptotic Performance
Proposition 3.2: Empirical Error Bound
Proposition 3.2: Value Iteration Error
Proposition 3.2: Time Complexity
Proposition 4.0: Equivalence
...and 31 more

Percentile Criterion Optimization in Offline Reinforcement Learning

TL;DR

Abstract

Percentile Criterion Optimization in Offline Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (41)