Table of Contents
Fetching ...

Highly Adaptive Empirical Risk Minimization with Principal Components

Carlos García Meixide, Mingxun Wang, Alejandro Schuler, Mark J. van der Laan

Abstract

The Highly Adaptive Lasso (HAL) delivers unprecedented guarantees in nonparametric minimum loss estimation under minimal smoothness assumptions, such as dimension-free minimax optimal rates. However, the practical use of HAL has been severely limited by its exponentially growing computationally prohibitive indicator basis expansion in moderate to high dimensions. Existing screening strategies drastically reduce this dimension but lack any theoretical justification. We introduce the Principal Component Highly Adaptive (PC-HA) family of estimators, which for the first time provide a principled and theoretically valid dimension reduction. We establish formal results on the score equations solved by these PC-HA estimators, allowing to transfer plug-in efficiency and pointwise asymptotic normality results from HAL to these PC-HA estimators, under comparable complexity control.

Highly Adaptive Empirical Risk Minimization with Principal Components

Abstract

The Highly Adaptive Lasso (HAL) delivers unprecedented guarantees in nonparametric minimum loss estimation under minimal smoothness assumptions, such as dimension-free minimax optimal rates. However, the practical use of HAL has been severely limited by its exponentially growing computationally prohibitive indicator basis expansion in moderate to high dimensions. Existing screening strategies drastically reduce this dimension but lack any theoretical justification. We introduce the Principal Component Highly Adaptive (PC-HA) family of estimators, which for the first time provide a principled and theoretically valid dimension reduction. We establish formal results on the score equations solved by these PC-HA estimators, allowing to transfer plug-in efficiency and pointwise asymptotic normality results from HAL to these PC-HA estimators, under comparable complexity control.
Paper Structure (46 sections, 18 theorems, 122 equations, 4 figures, 2 tables)

This paper contains 46 sections, 18 theorems, 122 equations, 4 figures, 2 tables.

Key Result

Lemma 1

Let $\{\tilde{\phi}_m : 1 \le m \le m_n\}$ and $\{E_m : 1 \le m \le m_n\}$ be defined as above. (i) Orthogonality with respect to $\langle\cdot,\cdot\rangle_n$. For $m_1 \neq m_2$, Moreover, where $\lambda_m$ is the eigenvalue associated with $E_m$, satisfying $I_N(E_m) = \lambda_m E_m$. (ii) Orthonormality with respect to the coefficient inner product. For $f_1, f_2 \in D^{(k)}({{\cal E}_{ {n}}

Figures (4)

  • Figure 1: Convergence rate of cross-validated PC-HAs using the mean MSE across repetitions (linear target function). Each panel shows log--log MSE versus sample size $n$ for a given dimension $d \in \{3,5,10\}$ (rows) and norm choice $\{L_1,L_2,\text{sectional variation}\}$ (columns). The black line is the empirical convergence slope, while the red dashed line is the theoretical reference with exponent $-2/3$.
  • Figure 2: Convergence rate of cross validated PC-HAs for the "fast" sinusoidal target function. Note that the slopes are much closer to $-1$.
  • Figure 3: Scaling behavior of different norms and metrics as a function of sample size $n$. Each row corresponds to a different metric: (1) $\|\alpha_n\|_2$ vs $n$, (2) $\|\alpha_n\|_1$ vs $n$, (3) $\|\alpha_n\|_\infty$ vs $n$, (4) $\|\beta(\alpha_n)\|_1$ vs $n$, and (5) $J_n$ (number of selected coefficients) vs $n$. Each column corresponds to a different regularization method (PC-HAGL, PC-HAL, PC-HAR). All plots are log-log.
  • Figure 4: One panel per PC-HA estimator (sectional variation, $L_2$, $L_1$): empirical mean of the efficient influence curve (EIC) versus the outcome regularization parameter $\lambda$. The red vertical line marks the cross-validated choice of $\lambda$ for the outcome fit. The blue curve is the plug-in bias (empirical mean of the EIC) for each value of $\lambda$; the horizontal blue lines correspond to the threshold $\tau$ defined in the main text (Section \ref{['sec:ate']}), so that undersmoothing selects the smallest $\lambda$ for which the blue curve lies within $\pm\tau$.

Theorems & Definitions (25)

  • Example 1
  • Example 2: Log-likelihood loss with exponential link
  • Example 3
  • Example 4
  • Definition 1: Coefficient map from PC basis to original basis
  • Lemma 1
  • Lemma 2: Relationships between PC models and the HAL space
  • Theorem 1: Convergence rate of PC-HA estimators
  • Corollary 1
  • Theorem 2
  • ...and 15 more