Matching-Based Policy Learning

Xuqiao Li; Ying Yan

Matching-Based Policy Learning

Xuqiao Li, Ying Yan

TL;DR

This work introduces MB-learning, a matching-based framework for policy learning in observational studies that directly targets the advantage function $A(\pi)$ to identify optimal treatment policies. By imputing counterfactuals via bias-corrected nearest-neighbor matching and recasting policy optimization as a weighted classification problem, MB-learning achieves covariate balance and robustness to propensity-score misspecification while providing non-asymptotic regret guarantees. The authors prove consistency and asymptotic normality for the matching estimators, derive a finite-sample regret bound, and demonstrate competitive finite-sample performance against AIPW-based and outcome-model-based methods through simulations and a real NSW Training Program application. The approach offers interpretable policy trees and is particularly advantageous when sample sizes are moderate or treatment groups are imbalanced. Overall, MB-learning provides a robust, covariate-balanced alternative to weighting-based policy learning with strong theoretical guarantees and practical effectiveness.

Abstract

The beneficial effects of treatments vary across individuals in most studies. Treatment heterogeneity motivates practitioners to search for the optimal policy based on personal characteristics. A long-standing common practice in policy learning has been estimating and maximizing the value function using weighting techniques. Matching is widely used in many applied disciplines to infer causal effects, which is intuitively appealing because the observed covariates are directly balanced across different treatment groups. Nevertheless, matching is rarely explored in policy learning. In this work, we propose a matching-based policy learning framework. We adapt standard and bias-corrected matching methods to estimate an alternative form of the value function: the advantage function, which can be interpreted as the expected improvement achieved by implementing a given policy compared to the equiprobable random policy. We then learn the optimal policy over a restricted policy class by maximizing the matching estimator of the advantage function. We derive a non-asymptotic high probability bound for the regret of the learned optimal policy. Moreover, we show that the learned policy is almost rate-optimal. The competitive finite sample performance of the proposed method compared to weighting-based and outcome modeling-based learning methods is demonstrated in extensive simulation studies and a real data application.

Matching-Based Policy Learning

TL;DR

This work introduces MB-learning, a matching-based framework for policy learning in observational studies that directly targets the advantage function

to identify optimal treatment policies. By imputing counterfactuals via bias-corrected nearest-neighbor matching and recasting policy optimization as a weighted classification problem, MB-learning achieves covariate balance and robustness to propensity-score misspecification while providing non-asymptotic regret guarantees. The authors prove consistency and asymptotic normality for the matching estimators, derive a finite-sample regret bound, and demonstrate competitive finite-sample performance against AIPW-based and outcome-model-based methods through simulations and a real NSW Training Program application. The approach offers interpretable policy trees and is particularly advantageous when sample sizes are moderate or treatment groups are imbalanced. Overall, MB-learning provides a robust, covariate-balanced alternative to weighting-based policy learning with strong theoretical guarantees and practical effectiveness.

Abstract

Paper Structure (33 sections, 9 theorems, 108 equations, 8 figures, 3 tables)

This paper contains 33 sections, 9 theorems, 108 equations, 8 figures, 3 tables.

Introduction
Methodology
Notations and Preliminaries
Matching-Based Advantage Function
Bias-Corrected Matching Estimator of the Advantage Function
Policy Learning with Matching
Implementation Detail
Regret Bound
Simulation Studies
Data Generating Process
Compared Methods and Implementation Details
Simulation Results
Application: Treatment Allocation in the NSW Training Program
Discussion
Technical Proofs
...and 18 more sections

Key Result

Lemma 1

The matching-based advantage function $\hat{A}_{match}(\pi)$ has the following expression: where $K_M(\pi,i)= \sum^n_{j=1}\{2\pi(X_j)-1\}M_{ji}=\sum_{j:M_{ji}=1}{2\pi(X_j)-1}$.

Figures (8)

Figure 1: Boxplot of empirical value functions where the main effect model is linear and the contrast function is tree. The global optimal value function is 1.25. The three rows represent different sample sizes $n=200, 500$, or 1000. The five columns represent different propensity score models presented in Table \ref{['data_generating_table']} of Supplementary Material.
Figure 2: Boxplot of empirical value function where the main effect model is linear and the contrast function is non-tree. The global optimal value function is 1.36. The three rows represent different sample sizes $n=200, 500$, or 1000. The five columns represent different propensity score models presented in Table \ref{['data_generating_table']} of Supplementary Material.
Figure S1: Boxplot of empirical value functions in different scenarios, where the main effect model is nonlinear and the contrast function is tree. The global optimal value function is 1.77.
Figure S2: Boxplot of empirical value functions in different scenarios, where the main effect model is nonlinear and the contrast function is non-tree. The global optimal value function is 1.87.
Figure S3: Boxplot of empirical value functions in different scenarios, where the main effect model is linear and the contrast function is tree.
...and 3 more figures

Theorems & Definitions (17)

Lemma 1
Lemma 2
Lemma 3
Lemma 4
Theorem 1
Corollary 1
Theorem 2
Theorem 3
Corollary 2
proof
...and 7 more

Matching-Based Policy Learning

TL;DR

Abstract

Matching-Based Policy Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (17)