The effect of Leaky ReLUs on the training and generalization of overparameterized networks

Yinglong Guo; Shaohan Li; Gilad Lerman

The effect of Leaky ReLUs on the training and generalization of overparameterized networks

Yinglong Guo, Shaohan Li, Gilad Lerman

TL;DR

The paper analyzes training and generalization of overparameterized neural networks using a broad class of Leaky ReLU activations σ_α. By extending neural tangent kernel–style analyses to Leaky ReLUs, it derives training-convergence bounds that depend on α and shows that α = -1 (the absolute value activation) yields optimal convergence, with corresponding benefits for generalization in settings with large data and early stopping. The authors also introduce a rescaled activation trick to simplify proofs, improve gradient bounds across all layers, and demonstrate that deeper networks can achieve similar convergence rates to shallow ones. Numerical experiments on synthetic data, F-MNIST, and CIFAR-10 corroborate the theory, showing fastest training and best early generalization for α = -1, and provide practical guidance for activation choice in overparameterized regimes. The work highlights the potential practical and theoretical advantages of using the absolute value activation in regression and classification tasks when training deep, wide networks.

Abstract

We investigate the training and generalization errors of overparameterized neural networks (NNs) with a wide class of leaky rectified linear unit (ReLU) functions. More specifically, we carefully upper bound both the convergence rate of the training error and the generalization error of such NNs and investigate the dependence of these bounds on the Leaky ReLU parameter, $α$. We show that $α=-1$, which corresponds to the absolute value activation function, is optimal for the training error bound. Furthermore, in special settings, it is also optimal for the generalization error bound. Numerical experiments empirically support the practical choices guided by the theory.

The effect of Leaky ReLUs on the training and generalization of overparameterized networks

TL;DR

Abstract

. We show that

, which corresponds to the absolute value activation function, is optimal for the training error bound. Furthermore, in special settings, it is also optimal for the generalization error bound. Numerical experiments empirically support the practical choices guided by the theory.

Paper Structure (28 sections, 25 theorems, 336 equations, 6 figures, 8 tables, 3 algorithms)

This paper contains 28 sections, 25 theorems, 336 equations, 6 figures, 8 tables, 3 algorithms.

INTRODUCTION
PROBLEM SETUP
MAIN RESULTS
IDEAS OF PROOFS
Proof Sketch
Discussion of Innovation
NUMERICAL EXPERIMENTS
Setup
Results
DISCUSSION
Discussion of the Generalization Error Bound
Proofs
Notation
Initialization
Perturbation
...and 13 more sections

Key Result

Theorem 3.1

Assume the setup of §sec:problem_setup, where both $m/\ln^4 m > \frac{1+\alpha^2}{(1-\alpha)^2}\Omega(\frac{n^5 L^{15} d}{\delta^4})$ and $m > \Omega\left(\ln\ln\epsilon^{-1}\right)$, and the training is according to Algorithm alg:training_gd with learning rate $\eta \leq O(\frac{d}{nL^2 m})$. Then, where

Figures (6)

Figure 1: Log-scale training and testing errors using different datasets and different $\alpha$'s. From left to right: synthetic dataset, F-MNIST and CIFAR-10. Top row: training errors. Bottom row: testing errors.
Figure 2: Comparison of the "shape" of the theoretical upper bound of the training convergence rate (orange line) with the calculated convergence rate (blue dots). We used the synthetic dataset (left) and California housing dataset (right) with different values of $\alpha$'s.
Figure 3: Log-scale training and testing errors using different datasets and different $\alpha$'s. Left: cross entropy errors for MNIST; Right: MSE for California housing. Top row: training errors. Bottom row: testing errors.
Figure 4: Log-scale training and testing errors using different datasets and different $\alpha$'s. Left: binary entropy errors for IMDB; Right: negative log likelihood errors for Transformer on MNIST. Top row: training errors. Bottom row: testing errors.
Figure 5: Log-scale errors on F-MNIST with different $\alpha$'s. Left: training errors at the last epoch with $L=2$ and different widths ($m$s); Right: testing errors at the epoch $t=300$ with $m=5000$ and different depths ($L$s).
...and 1 more figures

Theorems & Definitions (41)

Theorem 3.1
Theorem 3.2
Corollary 3.3
Theorem 3.4
Lemma 4.1: Semi-smoothness
Lemma 4.2: Gradient bounds
Lemma 4.3: Generalization error with perturbation
Lemma B.1
proof
Lemma B.2
...and 31 more

The effect of Leaky ReLUs on the training and generalization of overparameterized networks

TL;DR

Abstract

The effect of Leaky ReLUs on the training and generalization of overparameterized networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (41)