Optimizing Feature Selection in Causal Inference: A Three-Stage Computational Framework for Unbiased Estimation
Tianyu Yang, Md. Noor-E-Alam
TL;DR
The paper tackles unbiased causal effect estimation from observational data by rethinking feature selection, focusing on retaining confounders $X_C$ and pure outcome predictors $X_P$ while excluding pure treatment predictors $X_T$ and noise $X_S$ in the design stage. It proposes a robust three-stage framework that uses an SVM-based exposure model (Stage 1) followed by two adaptive elastic-net outcome-model steps (Stages 2 and 3), with penalty smoothing (sigmoid or tanh) to guide variable selection; this design improves oracle properties and reduces tuning sensitivity compared to traditional two-stage methods. Across extensive synthetic benchmarks and a large NSDUH dataset, the framework consistently achieves near-zero selection bias for ATT estimation, high stability in variable selection, and competitive computational efficiency, outperforming BACR, BCEE, and other baselines. The work demonstrates practical impact by enabling scalable, interpretable, and reliable causal inference in real-world high-dimensional observational data, with strong implications for healthcare policy analysis and beyond.
Abstract
Feature selection is an important but challenging task in causal inference for obtaining unbiased estimates of causal quantities. Properly selected features in causal inference not only significantly reduce the time required to implement a matching algorithm but, more importantly, can also reduce the bias and variance when estimating causal quantities. When feature selection techniques are applied in causal inference, the crucial criterion is to select variables that, when used for matching, can achieve an unbiased and robust estimation of causal quantities. Recent research suggests that balancing only on treatment-associated variables introduces bias while balancing on spurious variables increases variance. To address this issue, we propose an enhanced three-stage framework that shows a significant improvement in selecting the desired subset of variables compared to the existing state-of-the-art feature selection framework for causal inference, resulting in lower bias and variance in estimating the causal quantity. We evaluated our proposed framework using a state-of-the-art synthetic data across various settings and observed superior performance within a feasible computation time, ensuring scalability for large-scale datasets. Finally, to demonstrate the applicability of our proposed methodology using large-scale real-world data, we evaluated an important US healthcare policy related to the opioid epidemic crisis: whether opioid use disorder has a causal relationship with suicidal behavior.
