Matching-Based Policy Learning
Xuqiao Li, Ying Yan
TL;DR
This work introduces MB-learning, a matching-based framework for policy learning in observational studies that directly targets the advantage function $A(\pi)$ to identify optimal treatment policies. By imputing counterfactuals via bias-corrected nearest-neighbor matching and recasting policy optimization as a weighted classification problem, MB-learning achieves covariate balance and robustness to propensity-score misspecification while providing non-asymptotic regret guarantees. The authors prove consistency and asymptotic normality for the matching estimators, derive a finite-sample regret bound, and demonstrate competitive finite-sample performance against AIPW-based and outcome-model-based methods through simulations and a real NSW Training Program application. The approach offers interpretable policy trees and is particularly advantageous when sample sizes are moderate or treatment groups are imbalanced. Overall, MB-learning provides a robust, covariate-balanced alternative to weighting-based policy learning with strong theoretical guarantees and practical effectiveness.
Abstract
The beneficial effects of treatments vary across individuals in most studies. Treatment heterogeneity motivates practitioners to search for the optimal policy based on personal characteristics. A long-standing common practice in policy learning has been estimating and maximizing the value function using weighting techniques. Matching is widely used in many applied disciplines to infer causal effects, which is intuitively appealing because the observed covariates are directly balanced across different treatment groups. Nevertheless, matching is rarely explored in policy learning. In this work, we propose a matching-based policy learning framework. We adapt standard and bias-corrected matching methods to estimate an alternative form of the value function: the advantage function, which can be interpreted as the expected improvement achieved by implementing a given policy compared to the equiprobable random policy. We then learn the optimal policy over a restricted policy class by maximizing the matching estimator of the advantage function. We derive a non-asymptotic high probability bound for the regret of the learned optimal policy. Moreover, we show that the learned policy is almost rate-optimal. The competitive finite sample performance of the proposed method compared to weighting-based and outcome modeling-based learning methods is demonstrated in extensive simulation studies and a real data application.
