Percentile Criterion Optimization in Offline Reinforcement Learning
Elita A. Lobo, Cyrus Cousins, Yair Zick, Marek Petrik
TL;DR
This work tackles percentile-criterion optimization in offline/batch RL under model uncertainty. It introduces a Value-at-Risk (VaR) dynamic programming framework via a VaR Bellman operator, which contracts and provides a provable lower bound on the percentile objective without explicit ambiguity-set construction. The authors prove finite-sample and asymptotic performance guarantees, derive a generalized VaR value iteration, and show that the implicit VaR ambiguity sets are asymptotically smaller than Bayesian credible region-based sets. Empirically, the VaR framework yields tighter robustness guarantees and less conservative policies across several domains, with some exceptions at higher confidence levels where alternative robust methods can excel. The approach offers a principled, scalable alternative to Bayesian/credible-set strategies for risk-sensitive policy learning in data-scarce, high-stakes settings.
Abstract
In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the \emph{percentile criterion}. The percentile criterion is approximately solved by constructing an \emph{ambiguity set} that contains the true model with high probability and optimizing the policy for the worst model in the set. Since the percentile criterion is non-convex, constructing ambiguity sets is often challenging. Existing work uses \emph{Bayesian credible regions} as ambiguity sets, but they are often unnecessarily large and result in learning overly conservative policies. To overcome these shortcomings, we propose a novel Value-at-Risk based dynamic programming algorithm to optimize the percentile criterion without explicitly constructing any ambiguity sets. Our theoretical and empirical results show that our algorithm implicitly constructs much smaller ambiguity sets and learns less conservative robust policies.
