Table of Contents
Fetching ...

Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study

Yotam Alexander, Yonatan Slutzky, Yuval Ran-Milo, Nadav Cohen

TL;DR

The paper theoretically investigates whether gradient descent is necessary for generalization in overparameterized neural networks by focusing on matrix factorization with linear and non-linear activations. It proves a width-driven failure of the volume/G&C approach: as width grows, G&C generalization can become no better than random, indicating that gradient descent is needed in these wide regimes. Conversely, it shows depth-driven success: with linear activations, RIP, and a Gaussian prior with normalization, increasing depth makes G&C generalization arbitrarily good (and provably near-perfect for rank-1 ground truth), while empirical results corroborate depth helping G&C and width hurting. Together, the results reveal a nuanced, width-vs-depth landscape for generalization under gradient-descent-like versus Guess & Check dynamics and motivate further theory beyond matrix factorization.

Abstract

Conventional wisdom attributes the mysterious generalization abilities of overparameterized neural networks to gradient descent (and its variants). The recent volume hypothesis challenges this view: it posits that these generalization abilities persist even when gradient descent is replaced by Guess & Check (G&C), i.e., by drawing weight settings until one that fits the training data is found. The validity of the volume hypothesis for wide and deep neural networks remains an open question. In this paper, we theoretically investigate this question for matrix factorization (with linear and non-linear activation)--a common testbed in neural network theory. We first prove that generalization under G&C deteriorates with increasing width, establishing what is, to our knowledge, the first case where G&C is provably inferior to gradient descent. Conversely, we prove that generalization under G&C improves with increasing depth, revealing a stark contrast between wide and deep networks, which we further validate empirically. These findings suggest that even in simple settings, there may not be a simple answer to the question of whether neural networks need gradient descent to generalize well.

Do Neural Networks Need Gradient Descent to Generalize? A Theoretical Study

TL;DR

The paper theoretically investigates whether gradient descent is necessary for generalization in overparameterized neural networks by focusing on matrix factorization with linear and non-linear activations. It proves a width-driven failure of the volume/G&C approach: as width grows, G&C generalization can become no better than random, indicating that gradient descent is needed in these wide regimes. Conversely, it shows depth-driven success: with linear activations, RIP, and a Gaussian prior with normalization, increasing depth makes G&C generalization arbitrarily good (and provably near-perfect for rank-1 ground truth), while empirical results corroborate depth helping G&C and width hurting. Together, the results reveal a nuanced, width-vs-depth landscape for generalization under gradient-descent-like versus Guess & Check dynamics and motivate further theory beyond matrix factorization.

Abstract

Conventional wisdom attributes the mysterious generalization abilities of overparameterized neural networks to gradient descent (and its variants). The recent volume hypothesis challenges this view: it posits that these generalization abilities persist even when gradient descent is replaced by Guess & Check (G&C), i.e., by drawing weight settings until one that fits the training data is found. The validity of the volume hypothesis for wide and deep neural networks remains an open question. In this paper, we theoretically investigate this question for matrix factorization (with linear and non-linear activation)--a common testbed in neural network theory. We first prove that generalization under G&C deteriorates with increasing width, establishing what is, to our knowledge, the first case where G&C is provably inferior to gradient descent. Conversely, we prove that generalization under G&C improves with increasing depth, revealing a stark contrast between wide and deep networks, which we further validate empirically. These findings suggest that even in simple settings, there may not be a simple answer to the question of whether neural networks need gradient descent to generalize well.

Paper Structure

This paper contains 42 sections, 58 theorems, 334 equations, 13 figures, 5 tables.

Key Result

Theorem 1

Suppose the activation $\sigma ( \cdot )$ is admissible (def:admissible), and that it is anti-symmetric, meaning $\sigma ( - \alpha ) = - \sigma ( \alpha )$ for all $\alpha \in {\mathbb R}$. Let ${\mathcal{Q}} ( \cdot )$ be a regular probability distribution over ${\mathbb R}$ (def:regular), and let Moreover, in the case where ${\mathcal{Q}} ( \cdot )$ is a zero-centered Gaussian distribution, i.

Figures (13)

  • Figure 1: In line with our theory (\ref{['sec:analysis:width']}), as the width of a matrix factorization increases, the generalization attained by G&C deteriorates, to the point of being no better than chance, i.e., no better than the generalization attained by drawing a single weight setting from the prior distribution while disregarding the training data. In contrast, gradient descent attains good generalization across all widths. Each of the above plots corresponds to a matrix factorization as described in \ref{['sec:prelim:mf']}, with a different activation $\sigma ( \cdot )$: linear activation ($\sigma ( \alpha ) = \alpha$) for the left plot; tanh activation ($\sigma ( \alpha ) = \tanh ( \alpha )$) for the middle plot; and Leaky ReLU activation ($\sigma ( \alpha ) = \max \{ c \cdot \alpha , \alpha \}$, with $c = 0.2$) for the right plot.note:relu In each plot, the generalization loss (\ref{['eq:ms_loss_gen_mf']}) is shown against the width of the matrix factorization, for three optimizers: gradient descent with small step size and small initialization (\ref{['sec:prelim:gd']}); G&C with a Kaiming Gaussian prior distribution (\ref{['sec:prelim:gnc']}); and simply drawing a single weight setting from the prior distribution while disregarding the training data. For each combination of width and optimizer, we report the median (marker) and interquartile range (error bar) of generalization losses attained over eight trials (differing only in random seed). Across all experiments reported in this figure: the matrix factorization had depth two and dimensions $m = m' = 5$; the ground truth matrix had (Frobenius) norm and rank equal to one; and the training data size was $n = 15$. We note that with Leaky ReLU activation, which lies beyond the scope of our theory, the generalization attained by gradient descent is not as good as it is with linear and tanh activations. For further experiments and implementation details see \ref{['app:details', 'app:exper']}, respectively.
  • Figure 2: In line with our theory (\ref{['sec:analysis:depth']}), as the depth of a matrix factorization increases, the generalization attained by G&C improves, drawing closer to that of gradient descent. This figure adheres to the caption of \ref{['fig:width']}, except for the following differences: (i) the matrix factorization had variable (rather than fixed) depth and fixed (rather than variable) width, with the latter set to five; (ii) generalization losses are shown against the depth (rather than the width) of the factorization; and (iii) the prior distribution of G&C included normalization (\ref{['def:generated']}). We did not include depths greater than ten in our experiments, as they led to excessively long run times for gradient descent (due to vanishing gradients). Note that such greater depths would not necessarily lead the generalization attained by G&C to match that of gradient descent. Indeed, our theory for increasing depth (\ref{['result:depth_gnc']}) guarantees that the generalization loss attained by G&C tends to zero only if the threshold set for its training loss (see \ref{['sec:prelim:gnc']}) tends to zero, which makes G&C computationally infeasible (as it requires an infeasible number of draws). For further experiments and implementation details see \ref{['app:details', 'app:exper']}, respectively.
  • Figure 3: In line with our theory (\ref{['sec:analysis:width']}), as the width of a matrix factorization increases, the generalization attained by G&C deteriorates, to the point of being no better than chance, i.e., no better than the generalization attained by randomly drawing a single weight setting from the prior distribution while disregarding the training data. This figure adheres to the caption of \ref{['fig:width']}, except that we employ gradient descent with a momentum coefficient of 0.9 qian1999momentum. For further details see \ref{['app:details']}.
  • Figure 4: In line with our theory (\ref{['sec:analysis:depth']}), as the depth of a matrix factorization increases, the generalization attained by G&C improves, drawing closer to that of gradient descent. This figure adheres to the caption of \ref{['fig:depth']}, except that we employ gradient descent with a momentum coefficient of 0.9 qian1999momentum. For further details see \ref{['app:details']}.
  • Figure 5: In line with our theory (\ref{['sec:analysis:width']}), as the width of a matrix factorization increases, the generalization attained by G&C deteriorates, to the point of being no better than chance, i.e., no better than the generalization attained by randomly drawing a single weight setting from the prior distribution while disregarding the training data. In contrast, gradient descent attains good generalization across all widths. This figure adheres to the caption of \ref{['fig:width']}, except that the ground truth matrix had rank two and the training data size was $n=22$. For further details see \ref{['app:details']}.
  • ...and 8 more figures

Theorems & Definitions (125)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Theorem 1
  • proof : Proof sketch (full proof in \ref{['app:width_gnc']})
  • Proposition 1: restatement of Theorem 3.3 from soltanolkotabi2023implicit
  • Theorem 2
  • proof : Proof sketch (full proof in \ref{['app:depth_gnc']})
  • Definition 5
  • ...and 115 more