How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

Kuo-Wei Lai; Guanghui Wang; Molei Tao; Vidya Muthukumar

How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

Kuo-Wei Lai, Guanghui Wang, Molei Tao, Vidya Muthukumar

TL;DR

This work characterize the implicit bias of GD for training a shallow ReLU model with the squared loss on high-dimensional random features and shows that, for sufficiently high-dimensional random data, the implicit bias approximates the minimum-l2-norm solution with high probability with a gap on the order of $\Theta(\sqrt{n/d})$, where n is the number of training examples and d is the feature dimension.

Abstract

Overparameterized ML models, including neural networks, typically induce underdetermined training objectives with multiple global minima. The implicit bias refers to the limiting global minimum that is attained by a common optimization algorithm, such as gradient descent (GD). In this paper, we characterize the implicit bias of GD for training a shallow ReLU model with the squared loss on high-dimensional random features. Prior work showed that the implicit bias does not exist in the worst-case (Vardi and Shamir, 2021), or corresponds exactly to the minimum-l2-norm solution among all global minima under exactly orthogonal data (Boursier et al., 2022). Our work interpolates between these two extremes and shows that, for sufficiently high-dimensional random data, the implicit bias approximates the minimum-l2-norm solution with high probability with a gap on the order $Θ(\sqrt{n/d})$, where n is the number of training examples and d is the feature dimension. Our results are obtained through a novel primal-dual analysis, which carefully tracks the evolution of predictions, data-span coefficients, as well as their interactions, and shows that the ReLU activation pattern quickly stabilizes with high probability over the random data.

How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

TL;DR

, where n is the number of training examples and d is the feature dimension.

Abstract

, where n is the number of training examples and d is the feature dimension. Our results are obtained through a novel primal-dual analysis, which carefully tracks the evolution of predictions, data-span coefficients, as well as their interactions, and shows that the ReLU activation pattern quickly stabilizes with high probability over the random data.

Paper Structure (55 sections, 19 theorems, 133 equations, 7 figures, 1 table)

This paper contains 55 sections, 19 theorems, 133 equations, 7 figures, 1 table.

Introduction
Our contributions:
Our techniques in a nutshell:
Related Work
Notation:
Problem Setup
General ReLU Models and Empirical Risk Minimization.
Gradient Descent and Primal-dual Representation.
Minimum-l2-norm Solution.
Implicit Bias of Single ReLU Models (m=1) Under Gradient Descent
Gradient Descent Updates and Convergence
Sufficient Conditions for Gradient Descent Convergence
Minimum-l2-norm Solution of Single ReLU models
High-dimensional Implicit Bias of Single ReLU Models
Approximation to Minimum-l2-norm Solution in High Dimensions
...and 40 more sections

Key Result

Lemma 1

Suppose there exists $t_0 \ge 0$ such that $\boldsymbol{D}(\boldsymbol{X}\boldsymbol{w}^{(t_0)})=\boldsymbol{D}(\boldsymbol{X}\boldsymbol{w}^{(t)})$ for all $t \geq t_0$. Define the subset of examples $S \coloneqq \{ i\in [n] : \boldsymbol{x}_i^\top\boldsymbol{w}^{(t_0)} > 0\}$. Then, for all $t \ge

Figures (7)

Figure 1: Gradient descent transition diagram for the $k$-th neuron.
Figure 2: Approximation error between the implicit bias of the single ReLU model $\boldsymbol{w}^{(\infty)}$ and the minimum-$\ell_2$-norm solution $\boldsymbol{w}^\star$.
Figure 3: We illustrate the prediction dynamics of gradient descent for a single ReLU model under different random initializations when $d$ is comparable with $n$. In both cases, with sufficiently small step size, the final solution converges to a linear minimum-$\ell_2$-norm interpolator on some subset of the training examples, i.e. of the form $\boldsymbol{w}_{\mathrm{linear-MNI},S} = \boldsymbol{X}_S^\top(\boldsymbol{X}_S\boldsymbol{X}_S^\top)^{-1}\tilde{\boldsymbol{y}}_S$, where $\tilde{y}_{S,i}=\max\{y_i,0\}$. In contrast to the high-dimensional regime, different initializations lead to different subsets $S$, indicating that ReLU training implicitly performs an example “selection’’ process, that is initialization-dependent, rather than fitting all positively-labeled samples. The experiment uses $n=10$, $d=50$, $\boldsymbol{x}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$, $y\sim\mathcal{N}(0,1)$, $\boldsymbol{w}^{(0)}\sim\mathcal{N}(\boldsymbol{0},2\times10^{-6}\boldsymbol{I})$, and $\eta= 10^{-4}$.
Figure 4: Simulation illustrating Theorem \ref{['thm:multiple_relu_gd_high_dim_implicit_bias']}. In the high-dimensional regime and under our "all-positive" initialization, after the first gradient step, examples with positive labels remain active while examples with negative labels become inactive, consistent with Lemma \ref{['lem:primal_label_same_sign_two']}. The blue region shows primal variables that remain positive over training, whereas the red region corresponds to dual variables that are sufficiently negative and remain unchanged. As training proceeds, $\boldsymbol{w}_{\oplus}$ fits all positively labeled examples and $\boldsymbol{w}_{\ominus}$ fits all negatively labeled examples. The experiment uses $n=10$, $d=2000$, features $\boldsymbol{x}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$, and labels satisfying $|y|\sim\mathcal{U}(0.1,1)$ with $\mathrm{sign}(y)$ uniformly distributed over $\{\pm 1\}$.
Figure 5: Simulation with random initialization in the high-dimensional regime, which violates our initialization assumption in Theorem \ref{['thm:multiple_relu_gd_high_dim_implicit_bias']}. Under random initialization, the sufficient conditions of Lemma \ref{['lem:primal_label_same_sign_two']} are violated at the first gradient step. As a result, positively labeled examples do not all remain in the active (blue) regime (e.g. example no. 5), nor do negatively labeled examples consistently enter the inactive (red) regime (e.g. example no. 7). Consequently, during training, this model fails to converge to a global minimum. The experiment uses $n=10$, $d=2000$, features $\boldsymbol{x}\sim\mathcal{N}(\boldsymbol{0},\boldsymbol{I})$, and labels satisfying $|y|\sim\mathcal{U}(0.1,1)$ with $\mathrm{sign}(y)$ uniformly distributed over $\{\pm 1\}$.
...and 2 more figures

Theorems & Definitions (36)

Lemma 1
Lemma 2
Lemma 3
Theorem 1
Remark 1
Theorem 2
Lemma 4
Theorem 3
Theorem 4
Lemma 5
...and 26 more

How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

TL;DR

Abstract

How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (36)