Table of Contents
Fetching ...

Automatic debiasing of neural networks via moment-constrained learning

Christian L. Hines, Oliver J. Hines

TL;DR

The paper tackles bias in estimating average moment estimands $\Psi=\mathbb{E}[m(\mu,W)]$ by moving from learning the full Riesz representer $\alpha$ to learning a constrained auxiliary function $\beta_\perp$ so that $\alpha$ is proportional to $\beta-\beta_\perp$. It introduces moment-constrained learning, with a key identity $\alpha(z)= \frac{h(\beta)}{\|\beta-\beta_\perp\|^2}\{\beta(z)-\beta_\perp(z)\}$ and a practical MADNet architecture that jointly learns $\mu$ and $\beta_\perp$ via a constrained loss. The approach yields debiased estimators with RAL guarantees under mild conditions and shows improved empirical performance over state-of-the-art automatic debiasing methods on semi-synthetic causal inference tasks. This framework provides a robust, automatic alternative to deriving and learning the RR, with potential extensions to generalized regression settings and other ML models. Overall, moment-constrained learning enhances stability and accuracy in debiasing neural estimators for causal estimands in economics and biostatistics.

Abstract

Causal and nonparametric estimands in economics and biostatistics can often be viewed as the mean of a linear functional applied to an unknown outcome regression function. Naively learning the regression function and taking a sample mean of the target functional results in biased estimators, and a rich debiasing literature has developed where one additionally learns the so-called Riesz representer (RR) of the target estimand (targeted learning, double ML, automatic debiasing etc.). Learning the RR via its derived functional form can be challenging, e.g. due to extreme inverse probability weights or the need to learn conditional density functions. Such challenges have motivated recent advances in automatic debiasing (AD), where the RR is learned directly via minimization of a bespoke loss. We propose moment-constrained learning as a new RR learning approach that addresses some shortcomings in AD, constraining the predicted moments and improving the robustness of RR estimates to optimization hyperparamters. Though our approach is not tied to a particular class of learner, we illustrate it using neural networks, and evaluate on the problems of average treatment/derivative effect estimation using semi-synthetic data. Our numerical experiments show improved performance versus state of the art benchmarks.

Automatic debiasing of neural networks via moment-constrained learning

TL;DR

The paper tackles bias in estimating average moment estimands by moving from learning the full Riesz representer to learning a constrained auxiliary function so that is proportional to . It introduces moment-constrained learning, with a key identity and a practical MADNet architecture that jointly learns and via a constrained loss. The approach yields debiased estimators with RAL guarantees under mild conditions and shows improved empirical performance over state-of-the-art automatic debiasing methods on semi-synthetic causal inference tasks. This framework provides a robust, automatic alternative to deriving and learning the RR, with potential extensions to generalized regression settings and other ML models. Overall, moment-constrained learning enhances stability and accuracy in debiasing neural estimators for causal estimands in economics and biostatistics.

Abstract

Causal and nonparametric estimands in economics and biostatistics can often be viewed as the mean of a linear functional applied to an unknown outcome regression function. Naively learning the regression function and taking a sample mean of the target functional results in biased estimators, and a rich debiasing literature has developed where one additionally learns the so-called Riesz representer (RR) of the target estimand (targeted learning, double ML, automatic debiasing etc.). Learning the RR via its derived functional form can be challenging, e.g. due to extreme inverse probability weights or the need to learn conditional density functions. Such challenges have motivated recent advances in automatic debiasing (AD), where the RR is learned directly via minimization of a bespoke loss. We propose moment-constrained learning as a new RR learning approach that addresses some shortcomings in AD, constraining the predicted moments and improving the robustness of RR estimates to optimization hyperparamters. Though our approach is not tied to a particular class of learner, we illustrate it using neural networks, and evaluate on the problems of average treatment/derivative effect estimation using semi-synthetic data. Our numerical experiments show improved performance versus state of the art benchmarks.
Paper Structure (21 sections, 1 theorem, 28 equations, 4 figures, 3 tables)

This paper contains 21 sections, 1 theorem, 28 equations, 4 figures, 3 tables.

Key Result

Theorem 1

Let $\hat{\beta}_\perp$ be an estimator for $\beta_\perp$, and let $\hat{\mu}^*$ be an estimator for $\mu$ that is targeted such that $\mathbb{E}_n[\{\beta(Z) - \hat{\beta}_\perp(Z)\}\{Y - \hat{\mu}^*(Z)\}] = 0$. Assume that each of the following terms are $o_p(1)$: $\sqrt{n}\langle \hat{\mu}^* - \m

Figures (4)

  • Figure 1: Illustration of moment-constrained functions. The plane represents the space of zero average moment functions, i.e. $f$ such that $h(f) = \langle f, \alpha \rangle = 0$. The non-zero function $\beta - \beta_\perp$ is orthogonal to the plane, and thus is a scalar multiple of $\alpha$.
  • Figure 2: Multi-headed MLP architecture with three outputs. Typically the intermediate representation has the same width as the internal layers of the shared MLP, and the non-shared MLPs have internal layers with half the width of the shared MLP. During training, a single loss is used based on all scalar outputs, and MLP weights are learned using back-propagation over the entire multi-headed MLP.
  • Figure 3: Mean and standard error of $\mathbb{E}_n[A\hat{\alpha}(Z)] - 1$ (top row) and of $\hat{\Psi}^{(\text{IPW})} - h_n(\mu)$ (bottom row) using 20 datasets of IHDP data where predictions are made on a 20% validation set and the outcome is scaled by its standard deviation.
  • Figure 4: Top row: Low damping coefficients in the basic differential multiplier method (BDMM) Platt1987 lead to oscillatory behavior around the saddle point solution when the optimisation problem is formulated as an equality constrained Lagrangian. Bottom row: Using the inequality constrained Lagrangian approach described in the main paper results in more stable training and constraint satisfaction. A single dataset from the IHDP data is used to showcase this behavior over 200 epochs.

Theorems & Definitions (1)

  • Theorem 1