Table of Contents
Fetching ...

Fast Uncertainty Quantification for Kernel-Based Estimators in Large-Scale Causal Inference

Matthew Kosko, Falco J, Bargagli-Stoffi, Lin Wang, Michele Santacatterina

Abstract

Kernel methods are widely used in causal inference for tasks such as treatment effect estimation, policy evaluation, and policy learning. The bootstrap is a standard tool for uncertainty quantification because of its broad applicability. As increasingly large datasets become available, such as the 2023 U.S. Natality data from the National Vital Statistics System (NVSS), which includes 3,596,017 registered births, the computational demands of these methods increase substantially. Kernel methods are known to scale poorly with sample size, and this limitation is further exacerbated by the repeated re-fitting required by the bootstrap. As a result, bootstrap-based inference for kernel-based estimators can become computationally infeasible in large-scale settings. In this paper, we address these challenges by extending the causal Bag of Little Bootstraps (cBLB) algorithm to kernel methods. Our approach achieves computational scalability by combining subsampling and resampling while preserving first-order uncertainty quantification and asymptotically correct coverage. We evaluate the method across three representative implementations: kernelized augmented outcome-weighted learning, kernel-based minimax weighting, and double machine learning with kernel support vector machines. We show in simulations that our method yields confidence intervals with nominal coverage at a fraction of the computational cost. We further demonstrate its utility in a real-world application by estimating the effect of any amount of smoking on birth weight, as well as the optimal treatment regime, using the NVSS dataset, where the standard bootstrap is prohibitively expensive computationally and effectively infeasible at this scale.

Fast Uncertainty Quantification for Kernel-Based Estimators in Large-Scale Causal Inference

Abstract

Kernel methods are widely used in causal inference for tasks such as treatment effect estimation, policy evaluation, and policy learning. The bootstrap is a standard tool for uncertainty quantification because of its broad applicability. As increasingly large datasets become available, such as the 2023 U.S. Natality data from the National Vital Statistics System (NVSS), which includes 3,596,017 registered births, the computational demands of these methods increase substantially. Kernel methods are known to scale poorly with sample size, and this limitation is further exacerbated by the repeated re-fitting required by the bootstrap. As a result, bootstrap-based inference for kernel-based estimators can become computationally infeasible in large-scale settings. In this paper, we address these challenges by extending the causal Bag of Little Bootstraps (cBLB) algorithm to kernel methods. Our approach achieves computational scalability by combining subsampling and resampling while preserving first-order uncertainty quantification and asymptotically correct coverage. We evaluate the method across three representative implementations: kernelized augmented outcome-weighted learning, kernel-based minimax weighting, and double machine learning with kernel support vector machines. We show in simulations that our method yields confidence intervals with nominal coverage at a fraction of the computational cost. We further demonstrate its utility in a real-world application by estimating the effect of any amount of smoking on birth weight, as well as the optimal treatment regime, using the NVSS dataset, where the standard bootstrap is prohibitively expensive computationally and effectively infeasible at this scale.
Paper Structure (25 sections, 1 theorem, 22 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 1 theorem, 22 equations, 7 figures, 3 tables, 1 algorithm.

Key Result

theorem 1

Fix a subset $I_k=\{i_1,\ldots,i_b\}$ of size $b$ and fitted objects computed on $\{Z_i:i\in I_k\}$, yielding contributions $\{\hat{\theta}_{i,k}: i\in I_k\}$ and the subset estimator Let $M=(M_1,\ldots,M_b)\sim\mathrm{Multinomial}(n;1/b,\ldots,1/b)$ and define the cBLB replicate Assume: (i) the corresponding full-sample estimator admits an influence-function expansion with influence function $\

Figures (7)

  • Figure 1: Confidence intervals for the optimal value from 1000 replications from the cBLB algorithm, Kernelized AOL)
  • Figure 2: Confidence intervals for the ATE from 1000 replications from the cBLB algorithm, Kernel Minimax Weights)
  • Figure 3: Confidence intervals for the ATE from 1000 replications from the cBLB algorithm, DML using SVM and cross-fitting)
  • Figure 4: Timing results from 25 replications of the cBLB algorithm ($n = 5000$) for Kernelized AOL
  • Figure 5: Timing results from 25 replications of the cBLB algorithm ($n = 5000$) for Kernel Minimax Weights
  • ...and 2 more figures

Theorems & Definitions (1)

  • theorem 1: First-order validity of cBLB (no refit)