Table of Contents
Fetching ...

Sufficient Conditions for Stability of Minimum-Norm Interpolating Deep ReLU Networks

Ouns El Harzli, Yoonsoo Nam, Ilja Kuzborskij, Bernardo Cuenca Grau, Ard A. Louis

TL;DR

This paper studies the algorithmic stability of deep ReLU homogeneous neural networks that achieve zero training error using parameters with the smallest $L_2$ norm, also known as the minimum-norm interpolation, a phenomenon that can be observed in overparameterized models trained by gradient-based algorithms.

Abstract

Algorithmic stability is a classical framework for analyzing the generalization error of learning algorithms. It predicts that an algorithm has small generalization error if it is insensitive to small perturbations in the training set such as the removal or replacement of a training point. While stability has been demonstrated for numerous well-known algorithms, this framework has had limited success in analyses of deep neural networks. In this paper we study the algorithmic stability of deep ReLU homogeneous neural networks that achieve zero training error using parameters with the smallest $L_2$ norm, also known as the minimum-norm interpolation, a phenomenon that can be observed in overparameterized models trained by gradient-based algorithms. We investigate sufficient conditions for such networks to be stable. We find that 1) such networks are stable when they contain a (possibly small) stable sub-network, followed by a layer with a low-rank weight matrix, and 2) such networks are not guaranteed to be stable even when they contain a stable sub-network, if the following layer is not low-rank. The low-rank assumption is inspired by recent empirical and theoretical results which demonstrate that training deep neural networks is biased towards low-rank weight matrices, for minimum-norm interpolation and weight-decay regularization.

Sufficient Conditions for Stability of Minimum-Norm Interpolating Deep ReLU Networks

TL;DR

This paper studies the algorithmic stability of deep ReLU homogeneous neural networks that achieve zero training error using parameters with the smallest norm, also known as the minimum-norm interpolation, a phenomenon that can be observed in overparameterized models trained by gradient-based algorithms.

Abstract

Algorithmic stability is a classical framework for analyzing the generalization error of learning algorithms. It predicts that an algorithm has small generalization error if it is insensitive to small perturbations in the training set such as the removal or replacement of a training point. While stability has been demonstrated for numerous well-known algorithms, this framework has had limited success in analyses of deep neural networks. In this paper we study the algorithmic stability of deep ReLU homogeneous neural networks that achieve zero training error using parameters with the smallest norm, also known as the minimum-norm interpolation, a phenomenon that can be observed in overparameterized models trained by gradient-based algorithms. We investigate sufficient conditions for such networks to be stable. We find that 1) such networks are stable when they contain a (possibly small) stable sub-network, followed by a layer with a low-rank weight matrix, and 2) such networks are not guaranteed to be stable even when they contain a stable sub-network, if the following layer is not low-rank. The low-rank assumption is inspired by recent empirical and theoretical results which demonstrate that training deep neural networks is biased towards low-rank weight matrices, for minimum-norm interpolation and weight-decay regularization.
Paper Structure (43 sections, 5 theorems, 55 equations, 6 figures)

This paper contains 43 sections, 5 theorems, 55 equations, 6 figures.

Key Result

theorem 6

Suppose that datasets are $B$-admissible according to Assumption asm:data. Let $\epsilon = M / n^{-\alpha}$, for some $M \geq 0, \alpha > 0$ and let $a > 0$, consider $L \geq L^*$ and $1 \leq k \leq L-1$ that satisfy eq:low stable rank. Suppose that the sub-network is $\beta$-uniformly $\epsilon$-st

Figures (6)

  • Figure 1: A diagram of our main results. Our main results revolve around three arguments: a) the data is expressible by a neural network with bounded weight matrix norm (Assumption \ref{['asm:data']}), b) the minimum-norm interpolating ReLU neural network contains at least one layer with a low stable rank matrix (see Assumption \ref{['existence low stable rank']}) and c) the sub-network is stable (Hypothesis \ref{['asm:stable-subnetwork']}). In this scenario, we show in Theorem \ref{['stable sub network with stable rank']} that the full network is also stable. If b) does not hold (no low-rank layer), the full network may be unstable, even with a stable sub-network (\ref{['unstable network high rank']}).
  • Figure 2: Stability of sub-networks and stable rank of the layers. We trained an $8$-layer FCN on a uniformly drawn $10^4$ MNIST sample by minimizing a mean square error (MSE) loss to near zero, classifying the first $5$ classes as $-1$ and others as $1$. We performed multiple trials, where each trial is with identical initialization and a different portion of the training set is replaced for each trial. The error bars are $1$ standard deviation of the trial. Using the models, we measured the sign stability (left), i.e. $|\mathrm{sign}(f_k(\mathbf{x}; \, \hat{\theta})) -\mathrm{sign}(f_k(\mathbf{x}; \, \hat{\theta}^{(i)}))|$, stability (middle), and the stable rank of weight matrix (right) for each sub-network $f_k$ for $2 \leq k \leq 7$. The horizontal dotted lines are the (sign) stability of the full network. For the details of the experiment, link to our code, and additional experiments on Fashion-MNIST, see \ref{['app:experiments']}.
  • Figure 3: Stability as a function of number of data points We followed the same setting as in \ref{['fig:stability']} while varying training data set sizes. Both the sign stability (left) and stability (middle) of sub-networks (blue and orange) decay at a rate similar to that of the full network (green). The stable rank of weight matrices (right) also decreases as a function of $n$, suggesting that \ref{['existence low stable rank']} holds in the large $n$ limit. Observe that the slopes for the sub-networks and the full network are similar which validates that the respective stabilities have the same dependency in $n$ (\ref{['stable sub network with stable rank']}).
  • Figure 4: Stability of subnetworks and stable rank of the layers (Fashion-MNIST). We repeat the experiment of \ref{['fig:stability']}, but on Fashion-MNIST dataset. In agreement with the experiment for MNIST, FCN contains stable subnetwork and low-rank layers.
  • Figure 5: Stability as a function of data points (Fashion-MNIST) We repeat the experiment of \ref{['fig:stability_by_n']}, but on Fashion-MNIST dataset. As $n$ increases, the stability of the subnetwork $f_6$ (orange) is similar to the stability of the whole network (green) and the stable rank of $W_6$ also converges to $1$.
  • ...and 1 more figures

Theorems & Definitions (9)

  • definition 1: Sub-network
  • theorem 6
  • lemma 1
  • lemma 2
  • proof : Proof of \ref{['stable sub network with stable rank']}
  • lemma 3
  • proof : Proof of \ref{['unstable network high rank']}
  • lemma 4: timor2023implicit
  • proof : Proof of \ref{['lem:min-norm-sol-toy']}