A Two-Stage Interpretable Matching Framework for Causal Inference
Sahil Shikalgar, Md. Noor-E-Alam
TL;DR
The paper tackles causal inference from observational data by proposing TIM, a Two-stage Interpretable Matching framework that first enforces exact matching across all covariates and then iteratively removes the least important confounders to refine matches within strata. Within each stratum, a mixed distance metric combines Euclidean distances for continuous covariates with a distribution-aware discrete distance, enabling high-dimensional, mixed-type matching that preserves sample size. CATE is estimated per stratum using inverse-scores to weight control units by closeness when full matches are not possible, and then averaged across strata for the overall effect. Through synthetic simulations and a CDC BRFSS case study on high cholesterol and diabetes, TIM demonstrates improved multivariate overlap and robust CATE estimation, offering a scalable, interpretable tool for causal inference in observational healthcare data while acknowledging limitations such as binary treatment and cross-sectional data.
Abstract
Matching in causal inference from observational data aims to construct treatment and control groups with similar distributions of covariates, thereby reducing confounding and ensuring an unbiased estimation of treatment effects. This matched sample closely mimics a randomized controlled trial (RCT), thus improving the quality of causal estimates. We introduce a novel Two-stage Interpretable Matching (TIM) framework for transparent and interpretable covariate matching. In the first stage, we perform exact matching across all available covariates. For treatment and control units without an exact match in the first stage, we proceed to the second stage. Here, we iteratively refine the matching process by removing the least significant confounder in each iteration and attempting exact matching on the remaining covariates. We learn a distance metric for the dropped covariates to quantify closeness to the treatment unit(s) within the corresponding strata. We used these high- quality matches to estimate the conditional average treatment effects (CATEs). To validate TIM, we conducted experiments on synthetic datasets with varying association structures and correlations. We assessed its performance by measuring bias in CATE estimation and evaluating multivariate overlap between treatment and control groups before and after matching. Additionally, we apply TIM to a real-world healthcare dataset from the Centers for Disease Control and Prevention (CDC) to estimate the causal effect of high cholesterol on diabetes. Our results demonstrate that TIM improves CATE estimates, increases multivariate overlap, and scales effectively to high-dimensional data, making it a robust tool for causal inference in observational data.
