Table of Contents
Fetching ...

The effect of Leaky ReLUs on the training and generalization of overparameterized networks

Yinglong Guo, Shaohan Li, Gilad Lerman

TL;DR

The paper analyzes training and generalization of overparameterized neural networks using a broad class of Leaky ReLU activations σ_α. By extending neural tangent kernel–style analyses to Leaky ReLUs, it derives training-convergence bounds that depend on α and shows that α = -1 (the absolute value activation) yields optimal convergence, with corresponding benefits for generalization in settings with large data and early stopping. The authors also introduce a rescaled activation trick to simplify proofs, improve gradient bounds across all layers, and demonstrate that deeper networks can achieve similar convergence rates to shallow ones. Numerical experiments on synthetic data, F-MNIST, and CIFAR-10 corroborate the theory, showing fastest training and best early generalization for α = -1, and provide practical guidance for activation choice in overparameterized regimes. The work highlights the potential practical and theoretical advantages of using the absolute value activation in regression and classification tasks when training deep, wide networks.

Abstract

We investigate the training and generalization errors of overparameterized neural networks (NNs) with a wide class of leaky rectified linear unit (ReLU) functions. More specifically, we carefully upper bound both the convergence rate of the training error and the generalization error of such NNs and investigate the dependence of these bounds on the Leaky ReLU parameter, $α$. We show that $α=-1$, which corresponds to the absolute value activation function, is optimal for the training error bound. Furthermore, in special settings, it is also optimal for the generalization error bound. Numerical experiments empirically support the practical choices guided by the theory.

The effect of Leaky ReLUs on the training and generalization of overparameterized networks

TL;DR

The paper analyzes training and generalization of overparameterized neural networks using a broad class of Leaky ReLU activations σ_α. By extending neural tangent kernel–style analyses to Leaky ReLUs, it derives training-convergence bounds that depend on α and shows that α = -1 (the absolute value activation) yields optimal convergence, with corresponding benefits for generalization in settings with large data and early stopping. The authors also introduce a rescaled activation trick to simplify proofs, improve gradient bounds across all layers, and demonstrate that deeper networks can achieve similar convergence rates to shallow ones. Numerical experiments on synthetic data, F-MNIST, and CIFAR-10 corroborate the theory, showing fastest training and best early generalization for α = -1, and provide practical guidance for activation choice in overparameterized regimes. The work highlights the potential practical and theoretical advantages of using the absolute value activation in regression and classification tasks when training deep, wide networks.

Abstract

We investigate the training and generalization errors of overparameterized neural networks (NNs) with a wide class of leaky rectified linear unit (ReLU) functions. More specifically, we carefully upper bound both the convergence rate of the training error and the generalization error of such NNs and investigate the dependence of these bounds on the Leaky ReLU parameter, . We show that , which corresponds to the absolute value activation function, is optimal for the training error bound. Furthermore, in special settings, it is also optimal for the generalization error bound. Numerical experiments empirically support the practical choices guided by the theory.
Paper Structure (28 sections, 25 theorems, 336 equations, 6 figures, 8 tables, 3 algorithms)

This paper contains 28 sections, 25 theorems, 336 equations, 6 figures, 8 tables, 3 algorithms.

Key Result

Theorem 3.1

Assume the setup of §sec:problem_setup, where both $m/\ln^4 m > \frac{1+\alpha^2}{(1-\alpha)^2}\Omega(\frac{n^5 L^{15} d}{\delta^4})$ and $m > \Omega\left(\ln\ln\epsilon^{-1}\right)$, and the training is according to Algorithm alg:training_gd with learning rate $\eta \leq O(\frac{d}{nL^2 m})$. Then, where

Figures (6)

  • Figure 1: Log-scale training and testing errors using different datasets and different $\alpha$'s. From left to right: synthetic dataset, F-MNIST and CIFAR-10. Top row: training errors. Bottom row: testing errors.
  • Figure 2: Comparison of the "shape" of the theoretical upper bound of the training convergence rate (orange line) with the calculated convergence rate (blue dots). We used the synthetic dataset (left) and California housing dataset (right) with different values of $\alpha$'s.
  • Figure 3: Log-scale training and testing errors using different datasets and different $\alpha$'s. Left: cross entropy errors for MNIST; Right: MSE for California housing. Top row: training errors. Bottom row: testing errors.
  • Figure 4: Log-scale training and testing errors using different datasets and different $\alpha$'s. Left: binary entropy errors for IMDB; Right: negative log likelihood errors for Transformer on MNIST. Top row: training errors. Bottom row: testing errors.
  • Figure 5: Log-scale errors on F-MNIST with different $\alpha$'s. Left: training errors at the last epoch with $L=2$ and different widths ($m$s); Right: testing errors at the epoch $t=300$ with $m=5000$ and different depths ($L$s).
  • ...and 1 more figures

Theorems & Definitions (41)

  • Theorem 3.1
  • Theorem 3.2
  • Corollary 3.3
  • Theorem 3.4
  • Lemma 4.1: Semi-smoothness
  • Lemma 4.2: Gradient bounds
  • Lemma 4.3: Generalization error with perturbation
  • Lemma B.1
  • proof
  • Lemma B.2
  • ...and 31 more