Policy Testing in Markov Decision Processes
Kaito Ariu, Po-An Wang, Alexandre Proutiere, Kenshi Abe
TL;DR
This work tackles policy testing in discounted MDPs under fixed confidence, seeking to decide if $V_p^oldsymbol{ ho}(oldsymbol{ ho})>0$ with minimal samples. It identifies a non-convex instance-dependent lower bound arising from the confusing set Alt(p) and overcomes this by reformulating the problem as a reversed MDP, enabling a projected policy gradient solution to the dual problem. The resulting PTST algorithm achieves asymptotic instance-optimality in sample complexity and outperforms existing methods in numerical experiments. This framework not only provides the first tractable route to instance-specific optimal pure exploration in MDPs but also offers a pathway to extending to policy evaluation and other pure-exploration tasks.
Abstract
We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem--a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal--matching the instance-specific lower bound on sample complexity--while remaining computationally tractable. We validate our approach with numerical experiments.
