Table of Contents
Fetching ...

Optimality and Adaptivity of Deep Neural Features for Instrumental Variable Regression

Juno Kim, Dimitri Meunier, Arthur Gretton, Taiji Suzuki, Zhu Li

TL;DR

This work analyzes nonparametric instrumental variable regression and provides a rigorous minimax-rate guarantee for the Deep Feature Instrumental Variable Regression (DFIV) method, showing that data-adaptive neural features yield optimal convergence when the structural function lies in a Besov class $B_{p,q}^s(\mathcal{X})$. It introduces a two-stage DFIV framework that learns feature maps $\psi_{\theta_x}$ and $\phi_{\theta_z}$ in tandem, with a smooth DNN class guaranteeing Besov-norm control, and proves both upper and lower bounds for the projected and full non-projected risks under link and smoothness conditions. A key result is the demonstration of a separation between DFIV and fixed-feature IV when $p<2$, establishing adaptivity to spatial inhomogeneity and showing that Stage 1 needs only $m=\Omega(n)$ samples to attain minimax rates, unlike kernel methods that require $m/n\to\infty$. The analysis combines a dynamic, data-dependent covering approach with Besov-space approximation theory, providing a principled basis for using neural features in causal IV regression and informing practical sample-splitting strategies. The findings have significant implications for the design of data-efficient, adaptive IV estimators in high-dimensional settings.

Abstract

We provide a convergence analysis of deep feature instrumental variable (DFIV) regression (Xu et al., 2021), a nonparametric approach to IV regression using data-adaptive features learned by deep neural networks in two stages. We prove that the DFIV algorithm achieves the minimax optimal learning rate when the target structural function lies in a Besov space. This is shown under standard nonparametric IV assumptions, and an additional smoothness assumption on the regularity of the conditional distribution of the covariate given the instrument, which controls the difficulty of Stage 1. We further demonstrate that DFIV, as a data-adaptive algorithm, is superior to fixed-feature (kernel or sieve) IV methods in two ways. First, when the target function possesses low spatial homogeneity (i.e., it has both smooth and spiky/discontinuous regions), DFIV still achieves the optimal rate, while fixed-feature methods are shown to be strictly suboptimal. Second, comparing with kernel-based two-stage regression estimators, DFIV is provably more data efficient in the Stage 1 samples.

Optimality and Adaptivity of Deep Neural Features for Instrumental Variable Regression

TL;DR

This work analyzes nonparametric instrumental variable regression and provides a rigorous minimax-rate guarantee for the Deep Feature Instrumental Variable Regression (DFIV) method, showing that data-adaptive neural features yield optimal convergence when the structural function lies in a Besov class . It introduces a two-stage DFIV framework that learns feature maps and in tandem, with a smooth DNN class guaranteeing Besov-norm control, and proves both upper and lower bounds for the projected and full non-projected risks under link and smoothness conditions. A key result is the demonstration of a separation between DFIV and fixed-feature IV when , establishing adaptivity to spatial inhomogeneity and showing that Stage 1 needs only samples to attain minimax rates, unlike kernel methods that require . The analysis combines a dynamic, data-dependent covering approach with Besov-space approximation theory, providing a principled basis for using neural features in causal IV regression and informing practical sample-splitting strategies. The findings have significant implications for the design of data-efficient, adaptive IV estimators in high-dimensional settings.

Abstract

We provide a convergence analysis of deep feature instrumental variable (DFIV) regression (Xu et al., 2021), a nonparametric approach to IV regression using data-adaptive features learned by deep neural networks in two stages. We prove that the DFIV algorithm achieves the minimax optimal learning rate when the target structural function lies in a Besov space. This is shown under standard nonparametric IV assumptions, and an additional smoothness assumption on the regularity of the conditional distribution of the covariate given the instrument, which controls the difficulty of Stage 1. We further demonstrate that DFIV, as a data-adaptive algorithm, is superior to fixed-feature (kernel or sieve) IV methods in two ways. First, when the target function possesses low spatial homogeneity (i.e., it has both smooth and spiky/discontinuous regions), DFIV still achieves the optimal rate, while fixed-feature methods are shown to be strictly suboptimal. Second, comparing with kernel-based two-stage regression estimators, DFIV is provably more data efficient in the Stage 1 samples.
Paper Structure (48 sections, 22 theorems, 193 equations, 1 figure, 2 tables)

This paper contains 48 sections, 22 theorems, 193 equations, 1 figure, 2 tables.

Key Result

Theorem 3.1

Under Assumptions ass:noiseass:noise1,ass:str,ass:link and Assumption ass:smooth with domain restriction eqn:restricted or regularization eqn:soboreg with $\lambda$ asymptotic to the rate below, or Assumption ass:alternative without regularization, by choosing $\mathop{\mathrm{\mathcal{F}}}\nolimits

Figures (1)

  • Figure 1: Causal graph of IV.

Theorems & Definitions (38)

  • Definition 2.1
  • Definition 2.2: B-spline basis
  • Remark 2.3
  • Theorem 3.1: projected upper bound for DFIV
  • Lemma 3.2
  • Proposition 3.3: projected minimax lower bound
  • Corollary 3.4: projected optimality of DFIV
  • Theorem 3.5: full upper bound for DFIV
  • Remark 3.6
  • Proposition 3.7: full minimax lower bound
  • ...and 28 more