Penalising the biases in norm regularisation enforces sparsity

Etienne Boursier; Nicolas Flammarion

Penalising the biases in norm regularisation enforces sparsity

Etienne Boursier, Nicolas Flammarion

TL;DR

The paper investigates how regularising the parameters' norm in a univariate one-hidden-layer ReLU network with a skip connection relates to the function the network represents. It shows that with a free skip, the representational cost of a function is the weighted total variation of its second derivative, $\|\sqrt{1+x^2}\,f''\|_{\mathrm{TV}}$, whereas omitting bias penalisation yields the unweighted $\|f''\|_{\mathrm{TV}}$; this weighting drives uniqueness and sparsity of the minimal-norm interpolator. A dynamic-programming reformulation on slopes yields a provable unique minimiser, and under mild data assumptions, this minimiser is among the sparsest interpolators (fewest kinks). Experiments corroborate the theory: including biases in the regularisation yields markedly sparser estimators compared to not penalising biases, illustrating how implicit regularisation of biases impacts the learned function and potential generalisation benefits.

Abstract

Controlling the parameters' norm often yields good generalisation when training neural networks. Beyond simple intuitions, the relation between regularising parameters' norm and obtained estimators remains theoretically misunderstood. For one hidden ReLU layer networks with unidimensional data, this work shows the parameters' norm required to represent a function is given by the total variation of its second derivative, weighted by a $\sqrt{1+x^2}$ factor. Notably, this weighting factor disappears when the norm of bias terms is not regularised. The presence of this additional weighting factor is of utmost significance as it is shown to enforce the uniqueness and sparsity (in the number of kinks) of the minimal norm interpolator. Conversely, omitting the bias' norm allows for non-sparse solutions. Penalising the bias terms in the regularisation, either explicitly or implicitly, thus leads to sparse estimators.

Penalising the biases in norm regularisation enforces sparsity

TL;DR

, whereas omitting bias penalisation yields the unweighted

; this weighting drives uniqueness and sparsity of the minimal-norm interpolator. A dynamic-programming reformulation on slopes yields a provable unique minimiser, and under mild data assumptions, this minimiser is among the sparsest interpolators (fewest kinks). Experiments corroborate the theory: including biases in the regularisation yields markedly sparser estimators compared to not penalising biases, illustrating how implicit regularisation of biases impacts the learned function and potential generalisation benefits.

Abstract

factor. Notably, this weighting factor disappears when the norm of bias terms is not regularised. The presence of this additional weighting factor is of utmost significance as it is shown to enforce the uniqueness and sparsity (in the number of kinks) of the minimal norm interpolator. Conversely, omitting the bias' norm allows for non-sparse solutions. Penalising the bias terms in the regularisation, either explicitly or implicitly, thus leads to sparse estimators.

Paper Structure (24 sections, 14 theorems, 153 equations, 6 figures)

This paper contains 24 sections, 14 theorems, 153 equations, 6 figures.

Introduction
Contributions.
Infinite width networks
Unpenalised skip connection
Representational cost
Computing minimal norm interpolator
Properties of minimal norm interpolator
Recovering a sparsest interpolator
Application to classification
Experiments
Conclusion
Discussing \ref{['ass:slopes']}
Additional experiments
Proofs of \ref{['sec:norm']}
Proof of \ref{['sec:compute']}
...and 9 more sections

Key Result

Theorem 1

For any Lipschitz function $f:\mathbb{R}\to\mathbb{R}$, For any non-Lipschitz function, $\hbox{$\m@th R$}{ \hbox{$\m@th\overline{\hbox{$\m@th\overline{\hbox{$\m@th\overline{\hbox{$\m@th\overline{\hbox{$\m@th\overline{\hbox{$\m@th\overline{\hbox{$\m@th\overline{\hbox{$\m@th\overline{\hbox{$\m@th\overline{\hbox{$\m@th\overline{\hbox{$\m@th\overline{\hbox{$

Figures (6)

Figure 1: Recursive definition of the dynamic program for $i\geq i_0$.
Figure 2: Partition given by $(n_k)_k$ on a toy example.
Figure 3: Final estimator when training one-hidden layer network with $\ell_2$ regularisation. The green dots correspond to the data and the green line is the estimated function. Each blue star represents a hidden neuron $(w_j,b_j)$ of the network: its $x$-axis value is given by $-b_j/w_{j}$, which coincides with the position of the kink of its associated ReLU; its $y$-axis value is given by the output weight $a_j$.
Figure 4: Case of difference between minimal norm interpolator and sparsest interpolator.
Figure 5: Minimiser of \ref{['eq:mininterpolator1']} on a toy data example.
...and 1 more figures

Theorems & Definitions (26)

Theorem 1
Example 1
Lemma 1
Lemma 2
Remark 1
Remark 2
Theorem 2
Lemma 3
Lemma 4
Theorem 3
...and 16 more

Penalising the biases in norm regularisation enforces sparsity

TL;DR

Abstract

Penalising the biases in norm regularisation enforces sparsity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (26)