Table of Contents
Fetching ...

Policy Testing in Markov Decision Processes

Kaito Ariu, Po-An Wang, Alexandre Proutiere, Kenshi Abe

TL;DR

This work tackles policy testing in discounted MDPs under fixed confidence, seeking to decide if $V_p^oldsymbol{ ho}(oldsymbol{ ho})>0$ with minimal samples. It identifies a non-convex instance-dependent lower bound arising from the confusing set Alt(p) and overcomes this by reformulating the problem as a reversed MDP, enabling a projected policy gradient solution to the dual problem. The resulting PTST algorithm achieves asymptotic instance-optimality in sample complexity and outperforms existing methods in numerical experiments. This framework not only provides the first tractable route to instance-specific optimal pure exploration in MDPs but also offers a pathway to extending to policy evaluation and other pure-exploration tasks.

Abstract

We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem--a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal--matching the instance-specific lower bound on sample complexity--while remaining computationally tractable. We validate our approach with numerical experiments.

Policy Testing in Markov Decision Processes

TL;DR

This work tackles policy testing in discounted MDPs under fixed confidence, seeking to decide if with minimal samples. It identifies a non-convex instance-dependent lower bound arising from the confusing set Alt(p) and overcomes this by reformulating the problem as a reversed MDP, enabling a projected policy gradient solution to the dual problem. The resulting PTST algorithm achieves asymptotic instance-optimality in sample complexity and outperforms existing methods in numerical experiments. This framework not only provides the first tractable route to instance-specific optimal pure exploration in MDPs but also offers a pathway to extending to policy evaluation and other pure-exploration tasks.

Abstract

We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem--a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal--matching the instance-specific lower bound on sample complexity--while remaining computationally tractable. We validate our approach with numerical experiments.

Paper Structure

This paper contains 44 sections, 33 theorems, 126 equations, 2 figures, 3 tables, 2 algorithms.

Key Result

Lemma 1

Under Assumption apt:general, $\min_{q\in \mathcal{P}}\max_{s,s'\in\mathcal{S}}V^\pi_q(s)-V^\pi_q(s')>0.$

Figures (2)

  • Figure 1: From the initial MDP (left) to the reversed MDP (right): In the reversed MDP, variables are shown in red; their initial MDP counterparts are shown in black.
  • Figure 2: Comparison of average stopping times and delta for the proposed algorithm and KLB-TS. The left, center, and right panels correspond to $|\mathcal{S}| = |\mathcal{A}| = 2, 3, 5$, respectively. Results are averaged over 30 instances. Error bars indicate the standard error of the mean.

Theorems & Definitions (61)

  • Definition 1
  • Lemma 1
  • Theorem 1
  • Proposition 1
  • Theorem 2
  • Proposition 2
  • Lemma 2: Simulation / performance difference lemma
  • Lemma 3: Policy gradient
  • Lemma 4: Smoothness
  • Theorem 3
  • ...and 51 more