The effect of Leaky ReLUs on the training and generalization of overparameterized networks
Yinglong Guo, Shaohan Li, Gilad Lerman
TL;DR
The paper analyzes training and generalization of overparameterized neural networks using a broad class of Leaky ReLU activations σ_α. By extending neural tangent kernel–style analyses to Leaky ReLUs, it derives training-convergence bounds that depend on α and shows that α = -1 (the absolute value activation) yields optimal convergence, with corresponding benefits for generalization in settings with large data and early stopping. The authors also introduce a rescaled activation trick to simplify proofs, improve gradient bounds across all layers, and demonstrate that deeper networks can achieve similar convergence rates to shallow ones. Numerical experiments on synthetic data, F-MNIST, and CIFAR-10 corroborate the theory, showing fastest training and best early generalization for α = -1, and provide practical guidance for activation choice in overparameterized regimes. The work highlights the potential practical and theoretical advantages of using the absolute value activation in regression and classification tasks when training deep, wide networks.
Abstract
We investigate the training and generalization errors of overparameterized neural networks (NNs) with a wide class of leaky rectified linear unit (ReLU) functions. More specifically, we carefully upper bound both the convergence rate of the training error and the generalization error of such NNs and investigate the dependence of these bounds on the Leaky ReLU parameter, $α$. We show that $α=-1$, which corresponds to the absolute value activation function, is optimal for the training error bound. Furthermore, in special settings, it is also optimal for the generalization error bound. Numerical experiments empirically support the practical choices guided by the theory.
