Highly Adaptive Empirical Risk Minimization with Principal Components

Carlos García Meixide; Mingxun Wang; Alejandro Schuler; Mark J. van der Laan

Highly Adaptive Empirical Risk Minimization with Principal Components

Carlos García Meixide, Mingxun Wang, Alejandro Schuler, Mark J. van der Laan

Abstract

The Highly Adaptive Lasso (HAL) delivers unprecedented guarantees in nonparametric minimum loss estimation under minimal smoothness assumptions, such as dimension-free minimax optimal rates. However, the practical use of HAL has been severely limited by its exponentially growing computationally prohibitive indicator basis expansion in moderate to high dimensions. Existing screening strategies drastically reduce this dimension but lack any theoretical justification. We introduce the Principal Component Highly Adaptive (PC-HA) family of estimators, which for the first time provide a principled and theoretically valid dimension reduction. We establish formal results on the score equations solved by these PC-HA estimators, allowing to transfer plug-in efficiency and pointwise asymptotic normality results from HAL to these PC-HA estimators, under comparable complexity control.

Highly Adaptive Empirical Risk Minimization with Principal Components

Abstract

Paper Structure (46 sections, 18 theorems, 122 equations, 4 figures, 2 tables)

This paper contains 46 sections, 18 theorems, 122 equations, 4 figures, 2 tables.

Introduction
Statistical estimation problem
Sectional variation norm, smoothness classes and their spline representation
Approximation results and covering numbers for smoothness classes
Finite dimensional linear starting working models
Theoretical properties of HAL
Screeners for HAL and discrete super-learning
Historical context and literature review
A novel fast-to-compute PC-screener that preserves HAL theory
The PC-working model
PC Highly Adaptive estimators: PC-HAGL, PC-HAL and PC-HAR
The PC dimension reduction working model $D^{(k)}({\cal E}_n)$
Definitions of PC-HA estimators
Computational and statistical comparisons of PC-HAGL, PC-HAL , PC-HAR .
Convergence rates
...and 31 more sections

Key Result

Lemma 1

Let $\{\tilde{\phi}_m : 1 \le m \le m_n\}$ and $\{E_m : 1 \le m \le m_n\}$ be defined as above. (i) Orthogonality with respect to $\langle\cdot,\cdot\rangle_n$. For $m_1 \neq m_2$, Moreover, where $\lambda_m$ is the eigenvalue associated with $E_m$, satisfying $I_N(E_m) = \lambda_m E_m$. (ii) Orthonormality with respect to the coefficient inner product. For $f_1, f_2 \in D^{(k)}({{\cal E}_{ {n}}

Figures (4)

Figure 1: Convergence rate of cross-validated PC-HAs using the mean MSE across repetitions (linear target function). Each panel shows log--log MSE versus sample size $n$ for a given dimension $d \in \{3,5,10\}$ (rows) and norm choice $\{L_1,L_2,\text{sectional variation}\}$ (columns). The black line is the empirical convergence slope, while the red dashed line is the theoretical reference with exponent $-2/3$.
Figure 2: Convergence rate of cross validated PC-HAs for the "fast" sinusoidal target function. Note that the slopes are much closer to $-1$.
Figure 3: Scaling behavior of different norms and metrics as a function of sample size $n$. Each row corresponds to a different metric: (1) $\|\alpha_n\|_2$ vs $n$, (2) $\|\alpha_n\|_1$ vs $n$, (3) $\|\alpha_n\|_\infty$ vs $n$, (4) $\|\beta(\alpha_n)\|_1$ vs $n$, and (5) $J_n$ (number of selected coefficients) vs $n$. Each column corresponds to a different regularization method (PC-HAGL, PC-HAL, PC-HAR). All plots are log-log.
Figure 4: One panel per PC-HA estimator (sectional variation, $L_2$, $L_1$): empirical mean of the efficient influence curve (EIC) versus the outcome regularization parameter $\lambda$. The red vertical line marks the cross-validated choice of $\lambda$ for the outcome fit. The blue curve is the plug-in bias (empirical mean of the EIC) for each value of $\lambda$; the horizontal blue lines correspond to the threshold $\tau$ defined in the main text (Section \ref{['sec:ate']}), so that undersmoothing selects the smallest $\lambda$ for which the blue curve lies within $\pm\tau$.

Theorems & Definitions (25)

Example 1
Example 2: Log-likelihood loss with exponential link
Example 3
Example 4
Definition 1: Coefficient map from PC basis to original basis
Lemma 1
Lemma 2: Relationships between PC models and the HAL space
Theorem 1: Convergence rate of PC-HA estimators
Corollary 1
Theorem 2
...and 15 more

Highly Adaptive Empirical Risk Minimization with Principal Components

Abstract

Highly Adaptive Empirical Risk Minimization with Principal Components

Authors

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (25)