Table of Contents
Fetching ...

Optimizing Feature Selection in Causal Inference: A Three-Stage Computational Framework for Unbiased Estimation

Tianyu Yang, Md. Noor-E-Alam

TL;DR

The paper tackles unbiased causal effect estimation from observational data by rethinking feature selection, focusing on retaining confounders $X_C$ and pure outcome predictors $X_P$ while excluding pure treatment predictors $X_T$ and noise $X_S$ in the design stage. It proposes a robust three-stage framework that uses an SVM-based exposure model (Stage 1) followed by two adaptive elastic-net outcome-model steps (Stages 2 and 3), with penalty smoothing (sigmoid or tanh) to guide variable selection; this design improves oracle properties and reduces tuning sensitivity compared to traditional two-stage methods. Across extensive synthetic benchmarks and a large NSDUH dataset, the framework consistently achieves near-zero selection bias for ATT estimation, high stability in variable selection, and competitive computational efficiency, outperforming BACR, BCEE, and other baselines. The work demonstrates practical impact by enabling scalable, interpretable, and reliable causal inference in real-world high-dimensional observational data, with strong implications for healthcare policy analysis and beyond.

Abstract

Feature selection is an important but challenging task in causal inference for obtaining unbiased estimates of causal quantities. Properly selected features in causal inference not only significantly reduce the time required to implement a matching algorithm but, more importantly, can also reduce the bias and variance when estimating causal quantities. When feature selection techniques are applied in causal inference, the crucial criterion is to select variables that, when used for matching, can achieve an unbiased and robust estimation of causal quantities. Recent research suggests that balancing only on treatment-associated variables introduces bias while balancing on spurious variables increases variance. To address this issue, we propose an enhanced three-stage framework that shows a significant improvement in selecting the desired subset of variables compared to the existing state-of-the-art feature selection framework for causal inference, resulting in lower bias and variance in estimating the causal quantity. We evaluated our proposed framework using a state-of-the-art synthetic data across various settings and observed superior performance within a feasible computation time, ensuring scalability for large-scale datasets. Finally, to demonstrate the applicability of our proposed methodology using large-scale real-world data, we evaluated an important US healthcare policy related to the opioid epidemic crisis: whether opioid use disorder has a causal relationship with suicidal behavior.

Optimizing Feature Selection in Causal Inference: A Three-Stage Computational Framework for Unbiased Estimation

TL;DR

The paper tackles unbiased causal effect estimation from observational data by rethinking feature selection, focusing on retaining confounders and pure outcome predictors while excluding pure treatment predictors and noise in the design stage. It proposes a robust three-stage framework that uses an SVM-based exposure model (Stage 1) followed by two adaptive elastic-net outcome-model steps (Stages 2 and 3), with penalty smoothing (sigmoid or tanh) to guide variable selection; this design improves oracle properties and reduces tuning sensitivity compared to traditional two-stage methods. Across extensive synthetic benchmarks and a large NSDUH dataset, the framework consistently achieves near-zero selection bias for ATT estimation, high stability in variable selection, and competitive computational efficiency, outperforming BACR, BCEE, and other baselines. The work demonstrates practical impact by enabling scalable, interpretable, and reliable causal inference in real-world high-dimensional observational data, with strong implications for healthcare policy analysis and beyond.

Abstract

Feature selection is an important but challenging task in causal inference for obtaining unbiased estimates of causal quantities. Properly selected features in causal inference not only significantly reduce the time required to implement a matching algorithm but, more importantly, can also reduce the bias and variance when estimating causal quantities. When feature selection techniques are applied in causal inference, the crucial criterion is to select variables that, when used for matching, can achieve an unbiased and robust estimation of causal quantities. Recent research suggests that balancing only on treatment-associated variables introduces bias while balancing on spurious variables increases variance. To address this issue, we propose an enhanced three-stage framework that shows a significant improvement in selecting the desired subset of variables compared to the existing state-of-the-art feature selection framework for causal inference, resulting in lower bias and variance in estimating the causal quantity. We evaluated our proposed framework using a state-of-the-art synthetic data across various settings and observed superior performance within a feasible computation time, ensuring scalability for large-scale datasets. Finally, to demonstrate the applicability of our proposed methodology using large-scale real-world data, we evaluated an important US healthcare policy related to the opioid epidemic crisis: whether opioid use disorder has a causal relationship with suicidal behavior.

Paper Structure

This paper contains 28 sections, 4 theorems, 12 equations, 5 figures, 1 table.

Key Result

Lemma 1

Assuming the outcome model is the linear combination of the treatment indicator and the subset of selected variables, i.e., $\bm{Y}=\bm{T}\alpha+\bm{X\beta}+\bm{\epsilon}$. To achieve unbiased estimation of causal quantity, the model should select $\bm{X_C}$ and $\bm{X_P}$, and exclude $\bm{X_T}$. A

Figures (5)

  • Figure 1: The selection bias to the ATT estimation for each model.
  • Figure 2: Variable selection across all the scenarios. The x-axis refers to the index of the covariates, while the y-axis indicates the probability of each variable being selected.
  • Figure 3: Bias of ATT estimation for causal machine learning models for each scenario.
  • Figure 4: Bias for each model regarding the estimated ATT from the domain knowledge. The ideal model/framework should have 0 bias.
  • Figure 5: Probability of each variable selected from opioid use data out of 500 runs. The x-axis refers to the index of the covariates, while the y-axis indicates the probability of each variable selected.

Theorems & Definitions (4)

  • Lemma 1: Selection Bias and Variance of Estimation of the Causal Quantity
  • Proposition 1: Ideal Subset of Variables to Include
  • Theorem 1: Oracle Property of Proposed Model
  • Proposition 2: Selection Ability of Proposed Model