Table of Contents
Fetching ...

Choosing Covariate Balancing Methods for Causal Inference: Practical Insights from a Simulation Study

Etienne Peyrot, Raphaël Porcher, Francois Petit

TL;DR

This paper addresses confounding in observational causal inference by benchmarking four covariate-balancing approaches—IPTW, energy balancing (EB), kernel optimal matching (KOM), and tailored-loss covariate balancing propensity scores (TLF)—against the standard estimand targets $ ext{ATE}$ and $ ext{ATT}$. It adopts author-guided implementations of each method and evaluates performance through extensive Monte Carlo simulations across 36 scenarios, using both weighted least squares (WLS) and a doubly robust (DR) estimator. The key findings are that DR estimation consistently softens sensitivity to the chosen weighting scheme when the outcome model adjusts for all confounders; EB and KOM generally offer the most reliable performance, with EB being tuning-free but scale-dependent and KOM requiring kernel and regularization choices; IPTW is variance-sensitive under low/missing overlap, while TLF exhibits low variance but higher bias, leading to RMSE plateaus. Practically, the results suggest triangulating conclusions across EB and KOM for robustness, using DR when the outcome model is well-specified, and recognizing the challenge of incorporating weight-estimation uncertainty into confidence intervals for newer methods. These insights provide pragmatic guidance for practitioners dealing with overlap limitations and model misspecification in observational data analysis.

Abstract

Background: Inverse probability of treatment weighting (IPTW) is used for confounding adjustment in observational studies. Newer weighting methods include energy balancing (EB), kernel optimal matching (KOM), and tailored-loss covariate balancing propensity scores (TLF), but practical guidance remains limited. We evaluate their performance when implemented according to published recommendations. Methods: We conducted Monte Carlo simulations across 36 scenarios varying sample size, treatment prevalence, and a complexity factor increasing confounding and reducing overlap. Data generation used predominantly categorical covariates with some correlation. Average treatment effect and average treatment effect on the treated were estimated using IPTW, EB, KOM, and TLF combined with weighted least squares and, when supported, a doubly robust (DR) estimators. Inference followed published recommendations for each method when feasible, using standard alternatives otherwise. \textsc{PROBITsim} dataset used for illustration. Results: DR reduced sensitivity to the weighting scheme with an outcome regression adjusted for all confounders, despite functional-form misspecification. EB and KOM were most reliable; EB was tuning-free but scale dependent, whereas KOM required kernel and penalty choices. IPTW was variance sensitive when treatment prevalence was far from 50\%. TLF traded lower variance for higher bias, producing an RMSE plateau and sub-nominal confidence interval coverage. \textsc{PROBITsim} results mirrored these patterns. Conclusions: Rather than identifying a best method, our findings highlight failure modes and tuning choices to monitor. When the outcome regression adjusts for all confounders, DR estimation can be dependable across weighting schemes. Incorporating weight-estimation uncertainty into confidence intervals remains a key challenge for newer approaches.

Choosing Covariate Balancing Methods for Causal Inference: Practical Insights from a Simulation Study

TL;DR

This paper addresses confounding in observational causal inference by benchmarking four covariate-balancing approaches—IPTW, energy balancing (EB), kernel optimal matching (KOM), and tailored-loss covariate balancing propensity scores (TLF)—against the standard estimand targets and . It adopts author-guided implementations of each method and evaluates performance through extensive Monte Carlo simulations across 36 scenarios, using both weighted least squares (WLS) and a doubly robust (DR) estimator. The key findings are that DR estimation consistently softens sensitivity to the chosen weighting scheme when the outcome model adjusts for all confounders; EB and KOM generally offer the most reliable performance, with EB being tuning-free but scale-dependent and KOM requiring kernel and regularization choices; IPTW is variance-sensitive under low/missing overlap, while TLF exhibits low variance but higher bias, leading to RMSE plateaus. Practically, the results suggest triangulating conclusions across EB and KOM for robustness, using DR when the outcome model is well-specified, and recognizing the challenge of incorporating weight-estimation uncertainty into confidence intervals for newer methods. These insights provide pragmatic guidance for practitioners dealing with overlap limitations and model misspecification in observational data analysis.

Abstract

Background: Inverse probability of treatment weighting (IPTW) is used for confounding adjustment in observational studies. Newer weighting methods include energy balancing (EB), kernel optimal matching (KOM), and tailored-loss covariate balancing propensity scores (TLF), but practical guidance remains limited. We evaluate their performance when implemented according to published recommendations. Methods: We conducted Monte Carlo simulations across 36 scenarios varying sample size, treatment prevalence, and a complexity factor increasing confounding and reducing overlap. Data generation used predominantly categorical covariates with some correlation. Average treatment effect and average treatment effect on the treated were estimated using IPTW, EB, KOM, and TLF combined with weighted least squares and, when supported, a doubly robust (DR) estimators. Inference followed published recommendations for each method when feasible, using standard alternatives otherwise. \textsc{PROBITsim} dataset used for illustration. Results: DR reduced sensitivity to the weighting scheme with an outcome regression adjusted for all confounders, despite functional-form misspecification. EB and KOM were most reliable; EB was tuning-free but scale dependent, whereas KOM required kernel and penalty choices. IPTW was variance sensitive when treatment prevalence was far from 50\%. TLF traded lower variance for higher bias, producing an RMSE plateau and sub-nominal confidence interval coverage. \textsc{PROBITsim} results mirrored these patterns. Conclusions: Rather than identifying a best method, our findings highlight failure modes and tuning choices to monitor. When the outcome regression adjusts for all confounders, DR estimation can be dependable across weighting schemes. Incorporating weight-estimation uncertainty into confidence intervals remains a key challenge for newer approaches.

Paper Structure

This paper contains 40 sections, 14 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: RMSE of the ATE estimate for each weighting method-estimator couple and scenario.
  • Figure 2: RMSE of the ATT estimate for each weighting method-estimator couple and scenario.
  • Figure 3: Coverage of the $95\%$ confidence interval of ATE estimate for each weighting method-estimator couple and scenario. The black dashed line is the nominal coverage ($95\%$).