Global Minimizers of $\ell^p$-Regularized Objectives Yield the Sparsest ReLU Neural Networks
Julia Nakhleh, Robert D. Nowak
TL;DR
This work addresses the problem of finding the sparsest ReLU interpolant for given data by introducing a differentiable objective based on the $\ell^p$ quasinorm with $0<p<1$, whose global minima correspond to sparsest single-hidden-layer networks. A variational reformulation recasts the problem as optimizing continuous piecewise-linear functions with respect to a $p$-variation cost $V_p(f)$ (and $V_0(f)$ counting knots), enabling a gradient-based approach to sparse interpolation. The authors establish univariate results showing uniqueness and explicit sparsity bounds, and extend to multivariate inputs by proving that sufficiently small $p$ yields sparsest solutions with width-invariant neuron counts and $O(N)$ active parameters; a finite-dimensional activation-pattern reformulation underpins these results. Experiments with reweighted $\ell^1$ regularization corroborate the theoretical claims, demonstrating faster and sparser interpolation than standard regularization schemes. Overall, the paper provides a principled continuous route to recovering truly sparse ReLU networks without pruning, with broad implications for efficiency and interpretability.
Abstract
Overparameterized neural networks can interpolate a given dataset in many different ways, prompting the fundamental question: which among these solutions should we prefer, and what explicit regularization strategies will provably yield these solutions? This paper addresses the challenge of finding the sparsest interpolating ReLU network--i.e., the network with the fewest nonzero parameters or neurons--a goal with wide-ranging implications for efficiency, generalization, interpretability, theory, and model compression. Unlike post hoc pruning approaches, we propose a continuous, almost-everywhere differentiable training objective whose global minima are guaranteed to correspond to the sparsest single-hidden-layer ReLU networks that fit the data. This result marks a conceptual advance: it recasts the combinatorial problem of sparse interpolation as a smooth optimization task, potentially enabling the use of gradient-based training methods. Our objective is based on minimizing $\ell^p$ quasinorms of the weights for $0 < p < 1$, a classical sparsity-promoting strategy in finite-dimensional settings. However, applying these ideas to neural networks presents new challenges: the function class is infinite-dimensional, and the weights are learned using a highly nonconvex objective. We prove that, under our formulation, global minimizers correspond exactly to sparsest solutions. Our work lays a foundation for understanding when and how continuous sparsity-inducing objectives can be leveraged to recover sparse networks through training.
