Table of Contents
Fetching ...

Matching-Based Policy Learning

Xuqiao Li, Ying Yan

TL;DR

This work introduces MB-learning, a matching-based framework for policy learning in observational studies that directly targets the advantage function $A(\pi)$ to identify optimal treatment policies. By imputing counterfactuals via bias-corrected nearest-neighbor matching and recasting policy optimization as a weighted classification problem, MB-learning achieves covariate balance and robustness to propensity-score misspecification while providing non-asymptotic regret guarantees. The authors prove consistency and asymptotic normality for the matching estimators, derive a finite-sample regret bound, and demonstrate competitive finite-sample performance against AIPW-based and outcome-model-based methods through simulations and a real NSW Training Program application. The approach offers interpretable policy trees and is particularly advantageous when sample sizes are moderate or treatment groups are imbalanced. Overall, MB-learning provides a robust, covariate-balanced alternative to weighting-based policy learning with strong theoretical guarantees and practical effectiveness.

Abstract

The beneficial effects of treatments vary across individuals in most studies. Treatment heterogeneity motivates practitioners to search for the optimal policy based on personal characteristics. A long-standing common practice in policy learning has been estimating and maximizing the value function using weighting techniques. Matching is widely used in many applied disciplines to infer causal effects, which is intuitively appealing because the observed covariates are directly balanced across different treatment groups. Nevertheless, matching is rarely explored in policy learning. In this work, we propose a matching-based policy learning framework. We adapt standard and bias-corrected matching methods to estimate an alternative form of the value function: the advantage function, which can be interpreted as the expected improvement achieved by implementing a given policy compared to the equiprobable random policy. We then learn the optimal policy over a restricted policy class by maximizing the matching estimator of the advantage function. We derive a non-asymptotic high probability bound for the regret of the learned optimal policy. Moreover, we show that the learned policy is almost rate-optimal. The competitive finite sample performance of the proposed method compared to weighting-based and outcome modeling-based learning methods is demonstrated in extensive simulation studies and a real data application.

Matching-Based Policy Learning

TL;DR

This work introduces MB-learning, a matching-based framework for policy learning in observational studies that directly targets the advantage function to identify optimal treatment policies. By imputing counterfactuals via bias-corrected nearest-neighbor matching and recasting policy optimization as a weighted classification problem, MB-learning achieves covariate balance and robustness to propensity-score misspecification while providing non-asymptotic regret guarantees. The authors prove consistency and asymptotic normality for the matching estimators, derive a finite-sample regret bound, and demonstrate competitive finite-sample performance against AIPW-based and outcome-model-based methods through simulations and a real NSW Training Program application. The approach offers interpretable policy trees and is particularly advantageous when sample sizes are moderate or treatment groups are imbalanced. Overall, MB-learning provides a robust, covariate-balanced alternative to weighting-based policy learning with strong theoretical guarantees and practical effectiveness.

Abstract

The beneficial effects of treatments vary across individuals in most studies. Treatment heterogeneity motivates practitioners to search for the optimal policy based on personal characteristics. A long-standing common practice in policy learning has been estimating and maximizing the value function using weighting techniques. Matching is widely used in many applied disciplines to infer causal effects, which is intuitively appealing because the observed covariates are directly balanced across different treatment groups. Nevertheless, matching is rarely explored in policy learning. In this work, we propose a matching-based policy learning framework. We adapt standard and bias-corrected matching methods to estimate an alternative form of the value function: the advantage function, which can be interpreted as the expected improvement achieved by implementing a given policy compared to the equiprobable random policy. We then learn the optimal policy over a restricted policy class by maximizing the matching estimator of the advantage function. We derive a non-asymptotic high probability bound for the regret of the learned optimal policy. Moreover, we show that the learned policy is almost rate-optimal. The competitive finite sample performance of the proposed method compared to weighting-based and outcome modeling-based learning methods is demonstrated in extensive simulation studies and a real data application.
Paper Structure (33 sections, 9 theorems, 108 equations, 8 figures, 3 tables)

This paper contains 33 sections, 9 theorems, 108 equations, 8 figures, 3 tables.

Key Result

Lemma 1

The matching-based advantage function $\hat{A}_{match}(\pi)$ has the following expression: where $K_M(\pi,i)= \sum^n_{j=1}\{2\pi(X_j)-1\}M_{ji}=\sum_{j:M_{ji}=1}{2\pi(X_j)-1}$.

Figures (8)

  • Figure 1: Boxplot of empirical value functions where the main effect model is linear and the contrast function is tree. The global optimal value function is 1.25. The three rows represent different sample sizes $n=200, 500$, or 1000. The five columns represent different propensity score models presented in Table \ref{['data_generating_table']} of Supplementary Material.
  • Figure 2: Boxplot of empirical value function where the main effect model is linear and the contrast function is non-tree. The global optimal value function is 1.36. The three rows represent different sample sizes $n=200, 500$, or 1000. The five columns represent different propensity score models presented in Table \ref{['data_generating_table']} of Supplementary Material.
  • Figure S1: Boxplot of empirical value functions in different scenarios, where the main effect model is nonlinear and the contrast function is tree. The global optimal value function is 1.77.
  • Figure S2: Boxplot of empirical value functions in different scenarios, where the main effect model is nonlinear and the contrast function is non-tree. The global optimal value function is 1.87.
  • Figure S3: Boxplot of empirical value functions in different scenarios, where the main effect model is linear and the contrast function is tree.
  • ...and 3 more figures

Theorems & Definitions (17)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Theorem 3
  • Corollary 2
  • proof
  • ...and 7 more