Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness

Haoyu Wei

Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness

Haoyu Wei

TL;DR

The paper tackles off-policy evaluation for the value of optimal policies in offline MDPs, where non-uniqueness of the optimal policy induces non-regularity and invalidates standard root-N inference. It introduces NSAVE, a nonparametric sequential estimator that uses trajectory-level EIFs and online nuisance-learning to achieve semiparametric efficiency in the regular regime and robust stability in degenerate regimes, complemented by a smoothing-based approach and post-selection inference. Theoretical results establish when the efficient influence function exists, the double robustness and efficiency of NSAVE under uniqueness, and the validity of smoothing and PSI approaches under broader conditions. Simulations and an OhioT1DM application illustrate stable coverage and meaningful patient-specific policy-value improvements, highlighting NSAVE's practical impact for robust inference in complex, non-regular OPE settings.

Abstract

Off-policy evaluation (OPE) constructs confidence intervals for the value of a target policy using data generated under a different behavior policy. Most existing inference methods focus on fixed target policies and may fail when the target policy is estimated as optimal, particularly when the optimal policy is non-unique or nearly deterministic. We study inference for the value of optimal policies in Markov decision processes. We characterize the existence of the efficient influence function and show that non-regularity arises under policy non-uniqueness. Motivated by this analysis, we propose a novel \textit{N}onparametric \textit{S}equenti\textit{A}l \textit{V}alue \textit{E}valuation (NSAVE) method, which achieves semiparametric efficiency and retains the double robustness property when the optimal policy is unique, and remains stable in degenerate regimes beyond the scope of existing asymptotic theory. We further develop a smoothing-based approach for valid inference under non-unique optimal policies, and a post-selection procedure with uniform coverage for data-selected optimal policies. Simulation studies support the theoretical results. An application to the OhioT1DM mobile health dataset provides patient-specific confidence intervals for optimal policy values and their improvement over observed treatment policies.

Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness

TL;DR

Abstract

Semiparametric Off-Policy Inference for Optimal Policy Values under Possible Non-Uniqueness

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (20)