Table of Contents
Fetching ...

The Challenges of Hyperparameter Tuning for Accurate Causal Effect Estimation

Damian Machlanski, Spyridon Samothrakis, Paul Clarke

TL;DR

The paper investigates how hyperparameter tuning affects causal effect estimation from observational data, highlighting the mismatch between observable proxy metrics and the inaccessible ideal metric due to counterfactuals. It formalizes observable vs. potential losses, surveys CATE estimators, and analyzes model selection within a nested optimization framework, using IHDP, Jobs, Twins, and News benchmarks to quantify SotA performance across hyperparameters and metrics. Key findings show that appropriate hyperparameters substantially raise the probability of achieving SotA performance across estimators and base learners, while the choice of observable metric can cause large variability and even degrade results; in some cases, metrics akin to matching-based or R-Loss approaches approach Oracle performance. The study argues for stronger focus on robust model selection metrics and standardized tuning practices, suggesting that hyperparameter quality, not estimator choice, often drives performance, with significant practical implications for benchmarking and causal inference workflows.

Abstract

ML is playing an increasingly crucial role in estimating causal effects of treatments on outcomes from observational data. Many ML methods (`causal estimators') have been proposed for this task. All of these methods, as with any ML approach, require extensive hyperparameter tuning. For non-causal predictive tasks, there is a consensus on the choice of tuning metrics (e.g. mean squared error), making it simple to compare models. However, for causal inference tasks, such a consensus is yet to be reached, making any comparison of causal models difficult. On top of that, there is no ideal metric on which to tune causal estimators, so one must rely on proxies. Furthermore, the fact that model selection in causal inference involves multiple components (causal estimator, ML regressor, hyperparameters, metric), complicates the issue even further. In order to evaluate the importance of each component, we perform an extensive empirical study on their combination. Our experimental setup involves many commonly used causal estimators, regressors (`base learners' henceforth) and metrics applied to four well-known causal inference benchmark datasets. Our results show that hyperparameter tuning increased the probability of reaching state-of-the-art performance in average ($65\% {\rightarrow} 81\%$) and individualised ($50\% {\rightarrow} 57\%$) effect estimation with only commonly used estimators. We also show that the performance of standard metrics can be inconsistent across different scenarios. Our findings highlight the need for further research to establish whether metrics uniformly capable of state-of-the-art performance in causal model evaluation can be found.

The Challenges of Hyperparameter Tuning for Accurate Causal Effect Estimation

TL;DR

The paper investigates how hyperparameter tuning affects causal effect estimation from observational data, highlighting the mismatch between observable proxy metrics and the inaccessible ideal metric due to counterfactuals. It formalizes observable vs. potential losses, surveys CATE estimators, and analyzes model selection within a nested optimization framework, using IHDP, Jobs, Twins, and News benchmarks to quantify SotA performance across hyperparameters and metrics. Key findings show that appropriate hyperparameters substantially raise the probability of achieving SotA performance across estimators and base learners, while the choice of observable metric can cause large variability and even degrade results; in some cases, metrics akin to matching-based or R-Loss approaches approach Oracle performance. The study argues for stronger focus on robust model selection metrics and standardized tuning practices, suggesting that hyperparameter quality, not estimator choice, often drives performance, with significant practical implications for benchmarking and causal inference workflows.

Abstract

ML is playing an increasingly crucial role in estimating causal effects of treatments on outcomes from observational data. Many ML methods (`causal estimators') have been proposed for this task. All of these methods, as with any ML approach, require extensive hyperparameter tuning. For non-causal predictive tasks, there is a consensus on the choice of tuning metrics (e.g. mean squared error), making it simple to compare models. However, for causal inference tasks, such a consensus is yet to be reached, making any comparison of causal models difficult. On top of that, there is no ideal metric on which to tune causal estimators, so one must rely on proxies. Furthermore, the fact that model selection in causal inference involves multiple components (causal estimator, ML regressor, hyperparameters, metric), complicates the issue even further. In order to evaluate the importance of each component, we perform an extensive empirical study on their combination. Our experimental setup involves many commonly used causal estimators, regressors (`base learners' henceforth) and metrics applied to four well-known causal inference benchmark datasets. Our results show that hyperparameter tuning increased the probability of reaching state-of-the-art performance in average () and individualised () effect estimation with only commonly used estimators. We also show that the performance of standard metrics can be inconsistent across different scenarios. Our findings highlight the need for further research to establish whether metrics uniformly capable of state-of-the-art performance in causal model evaluation can be found.
Paper Structure (31 sections, 28 equations, 11 figures, 1 table)

This paper contains 31 sections, 28 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: MSE - evaluation metric on observed data (validation set/cross-validation) used for model selection purposes (accessible with real datasets; lower is better). PEHE - evaluation metric on unobserved test data (not accessible with real datasets due to missing counterfactuals; lower is better).
  • Figure 2: Probability of reaching SotA performance levels (higher is better) depending on the quality of hyperparameters (colours). The results are aggregated across all CATE estimators and datasets. Error bars: 95% confidence intervals. X-axis: $\epsilon_{ATE}$ and $\epsilon_{ATT}$. Y-axis: $PEHE$ and $\mathcal{R}_{pol}$. Interpretation: default hyperparameters are not optimal; the Oracle performances, achieved via the best hyperparameter values and selected with potential metrics, have significantly higher probability of reaching SotA.
  • Figure 3: Probability of reaching SotA performance levels (higher is better) by individual CATE estimators (left) and base learners (right) depending on the quality of hyperparameters (colours). The results are aggregated across all datasets and potential metrics. Error bars: 95% confidence intervals. Interpretation: regardless of estimators and learners, the best hyperparameters provide mild or significant improvements in probability of reaching SotA as compared to default hyperparameters (with some exceptions).
  • Figure 4: Probability of reaching SotA performance levels (higher is better) depending on used model selection metrics (colours). The results are aggregated across all CATE estimators and datasets. Error bars: 95% confidence intervals. X-axis: $\epsilon_{ATE}$ and $\epsilon_{ATT}$. Y-axis: $PEHE$ and $\mathcal{R}_{pol}$. Only the best performing observable metrics are included for readability. The metrics presented with different colours were used to select candidates on the validation data, and then evaluated on the test data using potential metrics (axes x and y). Interpretation: the choice of a metric can seriously impact SotA probability and by extension the final estimation performance.
  • Figure 5: Probability of reaching SotA performance levels (higher is better) depending on used model selection metrics (y-axis). The results are aggregated across all CATE estimators, datasets and metrics. Error bars: 95% confidence intervals. The metrics presented along the y-axis were used to select candidates on the validation data, and then evaluated on the test data using potential metrics.
  • ...and 6 more figures