Automatic debiasing of neural networks via moment-constrained learning
Christian L. Hines, Oliver J. Hines
TL;DR
The paper tackles bias in estimating average moment estimands $\Psi=\mathbb{E}[m(\mu,W)]$ by moving from learning the full Riesz representer $\alpha$ to learning a constrained auxiliary function $\beta_\perp$ so that $\alpha$ is proportional to $\beta-\beta_\perp$. It introduces moment-constrained learning, with a key identity $\alpha(z)= \frac{h(\beta)}{\|\beta-\beta_\perp\|^2}\{\beta(z)-\beta_\perp(z)\}$ and a practical MADNet architecture that jointly learns $\mu$ and $\beta_\perp$ via a constrained loss. The approach yields debiased estimators with RAL guarantees under mild conditions and shows improved empirical performance over state-of-the-art automatic debiasing methods on semi-synthetic causal inference tasks. This framework provides a robust, automatic alternative to deriving and learning the RR, with potential extensions to generalized regression settings and other ML models. Overall, moment-constrained learning enhances stability and accuracy in debiasing neural estimators for causal estimands in economics and biostatistics.
Abstract
Causal and nonparametric estimands in economics and biostatistics can often be viewed as the mean of a linear functional applied to an unknown outcome regression function. Naively learning the regression function and taking a sample mean of the target functional results in biased estimators, and a rich debiasing literature has developed where one additionally learns the so-called Riesz representer (RR) of the target estimand (targeted learning, double ML, automatic debiasing etc.). Learning the RR via its derived functional form can be challenging, e.g. due to extreme inverse probability weights or the need to learn conditional density functions. Such challenges have motivated recent advances in automatic debiasing (AD), where the RR is learned directly via minimization of a bespoke loss. We propose moment-constrained learning as a new RR learning approach that addresses some shortcomings in AD, constraining the predicted moments and improving the robustness of RR estimates to optimization hyperparamters. Though our approach is not tied to a particular class of learner, we illustrate it using neural networks, and evaluate on the problems of average treatment/derivative effect estimation using semi-synthetic data. Our numerical experiments show improved performance versus state of the art benchmarks.
