Table of Contents
Fetching ...

Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness

Haoyu Wei

TL;DR

The paper tackles off-policy evaluation for the value of optimal policies in offline MDPs, where non-uniqueness of the optimal policy induces non-regularity and invalidates standard root-N inference. It introduces NSAVE, a nonparametric sequential estimator that uses trajectory-level EIFs and online nuisance-learning to achieve semiparametric efficiency in the regular regime and robust stability in degenerate regimes, complemented by a smoothing-based approach and post-selection inference. Theoretical results establish when the efficient influence function exists, the double robustness and efficiency of NSAVE under uniqueness, and the validity of smoothing and PSI approaches under broader conditions. Simulations and an OhioT1DM application illustrate stable coverage and meaningful patient-specific policy-value improvements, highlighting NSAVE's practical impact for robust inference in complex, non-regular OPE settings.

Abstract

Off-policy evaluation (OPE) constructs confidence intervals for the value of a target policy using data generated under a different behavior policy. Most existing inference methods focus on fixed target policies and may fail when the target policy is estimated as optimal, particularly when the optimal policy is non-unique or nearly deterministic. We study inference for the value of optimal policies in Markov decision processes. We characterize the existence of the efficient influence function and show that non-regularity arises under policy non-uniqueness. Motivated by this analysis, we propose a novel \textit{N}onparametric \textit{S}equenti\textit{A}l \textit{V}alue \textit{E}valuation (NSAVE) method, which achieves semiparametric efficiency and retains the double robustness property when the optimal policy is unique, and remains stable in degenerate regimes beyond the scope of existing asymptotic theory. We further develop a smoothing-based approach for valid inference under non-unique optimal policies, and a post-selection procedure with uniform coverage for data-selected optimal policies. Simulation studies support the theoretical results. An application to the OhioT1DM mobile health dataset provides patient-specific confidence intervals for optimal policy values and their improvement over observed treatment policies.

Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness

TL;DR

The paper tackles off-policy evaluation for the value of optimal policies in offline MDPs, where non-uniqueness of the optimal policy induces non-regularity and invalidates standard root-N inference. It introduces NSAVE, a nonparametric sequential estimator that uses trajectory-level EIFs and online nuisance-learning to achieve semiparametric efficiency in the regular regime and robust stability in degenerate regimes, complemented by a smoothing-based approach and post-selection inference. Theoretical results establish when the efficient influence function exists, the double robustness and efficiency of NSAVE under uniqueness, and the validity of smoothing and PSI approaches under broader conditions. Simulations and an OhioT1DM application illustrate stable coverage and meaningful patient-specific policy-value improvements, highlighting NSAVE's practical impact for robust inference in complex, non-regular OPE settings.

Abstract

Off-policy evaluation (OPE) constructs confidence intervals for the value of a target policy using data generated under a different behavior policy. Most existing inference methods focus on fixed target policies and may fail when the target policy is estimated as optimal, particularly when the optimal policy is non-unique or nearly deterministic. We study inference for the value of optimal policies in Markov decision processes. We characterize the existence of the efficient influence function and show that non-regularity arises under policy non-uniqueness. Motivated by this analysis, we propose a novel \textit{N}onparametric \textit{S}equenti\textit{A}l \textit{V}alue \textit{E}valuation (NSAVE) method, which achieves semiparametric efficiency and retains the double robustness property when the optimal policy is unique, and remains stable in degenerate regimes beyond the scope of existing asymptotic theory. We further develop a smoothing-based approach for valid inference under non-unique optimal policies, and a post-selection procedure with uniform coverage for data-selected optimal policies. Simulation studies support the theoretical results. An application to the OhioT1DM mobile health dataset provides patient-specific confidence intervals for optimal policy values and their improvement over observed treatment policies.

Paper Structure

This paper contains 58 sections, 17 theorems, 249 equations, 4 figures, 1 table.

Key Result

Theorem 3.1

Suppose that Assumptions ass:data_obs, ass:Mark, ass:regularity, and ass:deterministic_policy hold. Then the efficient influence function of $\Psi^*$ exists and satisfies $S^{\text{eff, nonpar}} \{ \Psi^*(P)\}|_{P = P_0} = S^{\text{eff, nonpar}} \{ \Psi(P; \pi)\}|_{P = P_0, \pi = \pi^*(P_0)}.$

Figures (4)

  • Figure 1: 95% Confidence intervals for the value difference between the estimated optimal policy and the behavior policy for six patients ($\gamma = 0.5$).
  • Figure 2: Log MSE and ECP of value estimates for varying $N$ and $T$ in Scenario A (Ideal Setting).
  • Figure 3: Log MSE and ECP of value estimates for varying $N$ and $T$ in Scenario B (Reward Contamination).
  • Figure 4: Confidence intervals for the value difference between the estimated optimal policy and the behavior policy for six patients, with varying discount factors $\gamma \in \{0.4, 0.7\}$.

Theorems & Definitions (20)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 5.1
  • Corollary 5.2
  • Theorem 5.3
  • Theorem 5.4
  • Corollary 5.5
  • Theorem 5.6
  • Theorem 6.1
  • Corollary 6.2
  • ...and 10 more