Table of Contents
Fetching ...

An analytic theory of generalization dynamics and transfer learning in deep linear networks

Andrew K. Lampinen, Surya Ganguli

TL;DR

This work develops an analytic theory of generalization dynamics for deep linear networks within a teacher-student, low-rank/noise framework. It yields closed-form training and test error expressions as functions of time, data size, initialization, and task structure, revealing that early learning prioritizes high-SNR task structure and can be largely independent of network size. A phase transition in how teacher information imprints onto training data and a non-gradient shrinkage algorithm are derived, along with a theory for transfer learning driven by input-feature alignment. The results extend, qualitatively, to nonlinear networks and deeper architectures, offering insight into task design and transfer strategies that could improve generalization in real-world deep learning.

Abstract

Much attention has been devoted recently to the generalization puzzle in deep learning: large, deep networks can generalize well, but existing theories bounding generalization error are exceedingly loose, and thus cannot explain this striking performance. Furthermore, a major hope is that knowledge may transfer across tasks, so that multi-task learning can improve generalization on individual tasks. However we lack analytic theories that can quantitatively predict how the degree of knowledge transfer depends on the relationship between the tasks. We develop an analytic theory of the nonlinear dynamics of generalization in deep linear networks, both within and across tasks. In particular, our theory provides analytic solutions to the training and testing error of deep networks as a function of training time, number of examples, network size and initialization, and the task structure and SNR. Our theory reveals that deep networks progressively learn the most important task structure first, so that generalization error at the early stopping time primarily depends on task structure and is independent of network size. This suggests any tight bound on generalization error must take into account task structure, and explains observations about real data being learned faster than random data. Intriguingly our theory also reveals the existence of a learning algorithm that proveably out-performs neural network training through gradient descent. Finally, for transfer learning, our theory reveals that knowledge transfer depends sensitively, but computably, on the SNRs and input feature alignments of pairs of tasks.

An analytic theory of generalization dynamics and transfer learning in deep linear networks

TL;DR

This work develops an analytic theory of generalization dynamics for deep linear networks within a teacher-student, low-rank/noise framework. It yields closed-form training and test error expressions as functions of time, data size, initialization, and task structure, revealing that early learning prioritizes high-SNR task structure and can be largely independent of network size. A phase transition in how teacher information imprints onto training data and a non-gradient shrinkage algorithm are derived, along with a theory for transfer learning driven by input-feature alignment. The results extend, qualitatively, to nonlinear networks and deeper architectures, offering insight into task design and transfer strategies that could improve generalization in real-world deep learning.

Abstract

Much attention has been devoted recently to the generalization puzzle in deep learning: large, deep networks can generalize well, but existing theories bounding generalization error are exceedingly loose, and thus cannot explain this striking performance. Furthermore, a major hope is that knowledge may transfer across tasks, so that multi-task learning can improve generalization on individual tasks. However we lack analytic theories that can quantitatively predict how the degree of knowledge transfer depends on the relationship between the tasks. We develop an analytic theory of the nonlinear dynamics of generalization in deep linear networks, both within and across tasks. In particular, our theory provides analytic solutions to the training and testing error of deep networks as a function of training time, number of examples, network size and initialization, and the task structure and SNR. Our theory reveals that deep networks progressively learn the most important task structure first, so that generalization error at the early stopping time primarily depends on task structure and is independent of network size. This suggests any tight bound on generalization error must take into account task structure, and explains observations about real data being learned faster than random data. Intriguingly our theory also reveals the existence of a learning algorithm that proveably out-performs neural network training through gradient descent. Finally, for transfer learning, our theory reveals that knowledge transfer depends sensitively, but computably, on the SNRs and input feature alignments of pairs of tasks.

Paper Structure

This paper contains 25 sections, 37 equations, 14 figures.

Figures (14)

  • Figure 1: Learning dynamics as a function of singular dimension strength. (a) shows how modes of different singular value are learned, (b) shows that there is a wave of learning that picks up singular dimensions with smaller and smaller singular values as $t \rightarrow \infty$.
  • Figure 2: The teacher's signal through the noise. Theoretical vs. empirical (a) histogram of singular values of noisy teacher $\hat{s}$. (b) $\hat{s}$ as a function of $\overline{s}$. (c) alignment of noisy teacher and noiseless teacher singular vectors as a function of $\overline{s}$. ($\overline{N_1}= \overline{N_3} = 100$.)
  • Figure 3: Match between theory and experiment for rank 1 (row 1, a-d) and rank 3 (row 2, e-h) teachers with single-hidden-layer students: (a-b, e-f) log train and test error, respectively, showing very close match between theory and experiment for TA, and close match for the random student. (c,g) comparing TA and randomly initialized students minimum generalization errors, showing almost perfect match. (d,h) comparing TA and randomly initialized students optimal stopping times, showing small lag due to alignment. ($N_1 = 100$, $N_2=50$, $N_3 = 50$.)
  • Figure 4: Our theory applies to deeper networks: match between theory and simulation for rank 1 (row 1, a-d) and rank 3 (row 2, e-h) teachers with $n_l=5$ students: (a-b, e-f) log train and test error, respectively, showing very close match between theory and experiment for TA. (c,g) comparing TA and randomly initialized students minimum generalization errors, showing almost perfect match. (d,h) comparing TA and randomly initialized students optimal stopping times, showing large lag due to slower alignment in deeper networks. ($N_1 = 100$, $N_2=50$, $N_3 = 50$.)
  • Figure 5: Train (first row, A-D) and test (second row, E-H) error for nonlinear networks (leaky relu at all hidden layers) with one hidden layer (first two columns) or three hidden layers (last two columns) trained on the tasks above, with a rank 1 teacher (first and third columns) or a rank 3 teacher (second and fourth columns). Note that many of the qualitative phenomena observed in linear networks, such as stage-like improvement in the errors, followed by a plateau, followed by overfitting, also appear in nonlinear networks. Compare the first column to Fig. \ref{['gen_results_fig']}AB, the second column to Fig. \ref{['gen_results_fig']}EF, the third to Fig. \ref{['deeper_results_fig']}AB, and the fourth to Fig. \ref{['deeper_results_fig']}EF. ($N_1 = 100$, $N_2=50$, $N_3 = 50$.)
  • ...and 9 more figures