Mixed Dynamics In Linear Networks: Unifying the Lazy and Active Regimes

Zhenfeng Tu; Santiago Aranguri; Arthur Jacot

Mixed Dynamics In Linear Networks: Unifying the Lazy and Active Regimes

Zhenfeng Tu, Santiago Aranguri, Arthur Jacot

TL;DR

This work provides a surprisingly simple unifying formula for the evolution of the learned matrix that contains as special cases both lazy and balanced regimes but also a mixed regime in between the two.

Abstract

The training dynamics of linear networks are well studied in two distinct setups: the lazy regime and balanced/active regime, depending on the initialization and width of the network. We provide a surprisingly simple unifying formula for the evolution of the learned matrix that contains as special cases both lazy and balanced regimes but also a mixed regime in between the two. In the mixed regime, a part of the network is lazy while the other is balanced. More precisely the network is lazy along singular values that are below a certain threshold and balanced along those that are above the same threshold. At initialization, all singular values are lazy, allowing for the network to align itself with the task, so that later in time, when some of the singular value cross the threshold and become active they will converge rapidly (convergence in the balanced regime is notoriously difficult in the absence of alignment). The mixed regime is the `best of both worlds': it converges from any random initialization (in contrast to balanced dynamics which require special initialization), and has a low rank bias (absent in the lazy dynamics). This allows us to prove an almost complete phase diagram of training behavior as a function of the variance at initialization and the width, for a MSE training task.

Mixed Dynamics In Linear Networks: Unifying the Lazy and Active Regimes

TL;DR

Abstract

Paper Structure (28 sections, 27 theorems, 206 equations, 2 figures)

This paper contains 28 sections, 27 theorems, 206 equations, 2 figures.

Introduction
Contributions
Previous Works
Setup
Lazy Dynamics
Balanced Dynamics
Mixed Lazy/Balanced Dynamics
Phase Diagram for MSE
Conclusion
Preliminaries
Convention and Notation
Matrix Inequalities
Perturbation of Singular Values and Singular Vectors
Proof of Theorem \ref{['thm:mixed_dynamics']}
Weak bound
...and 13 more sections

Key Result

Theorem 1

For a linear net $A_{\theta}=W_{2}W_{1}$ with width $w$, initialized with i.i.d. $\mathcal{N}(0,\sigma^{2})$ weights and trained with Gradient Flow, we have with high probability that for all time $t$,

Figures (2)

Figure 1: For both plots, we train either using gradient descent or the self-consistent dynamics from equation \ref{['eq:self-consistent']}, with the scaling $\gamma_{\sigma^2}=-1.85,$$\gamma_{w} = 2.25$ which lies in the active regime. (Left panel): We plot train and test error for both dynamics. We observe that the train/test error for gradient descent is very close to the train/test error for the self-consistent dynamics. (Right panel): We plot with a solid line the singular values of $A_{\theta(t)}$ when running the self-consistent dynamics, and use a dashed line for the singular values from running gradient descent. In this experiment, $\text{Rank} A^\star = 5.$ We use different colors for the $5$ largest singular values and the same color for the remaining singular values. We can see how the $5$ largest singular values 'speed up' as they cross the $\sigma^2 w$ threshold, allowing them to converge earlier than the rest. The minimal test error is achieved in the short period where the large singular values have converged but not the rest.
Figure 2: As a function of $\gamma_{\sigma^2}, \gamma_w,$ we run GD and plot different quantities. Our theoretical results only apply to the top left region for $\gamma_w>1$ and below the red line, although these plots suggest that some results may extend to smaller $\gamma_w$s. (Top left panel): We plot the smallest test error $\frac{1}{d^2}\Vert A_{\theta(t)}-A^{*}\Vert_F ^{2}$ in the whole run. The active region (below the black line) has a small error while the lazy region does not. (Top right panel): We plot the stable rank of $A_{\theta(t)}$ (defined as $\Vert A_{\theta(t)}\Vert_F ^{2} / \Vert A_{\theta(t)} \Vert_\text{op} ^{2}$) at the time of minimal test error. In this experiment, we took $\text{Rank} A^{*} = 5.$ We see that the active region has approximately the correct rank while the lazy region overestimates it. (Bottom left panel): We plot the number of iterations until minimal test error, illustrating the trade-off between test error and training time. (Bottom right panel): We compute $\ln \left(\frac{1}{d^2}\Vert A_{\theta(t)}-\hat{A}_{\theta(t)}\Vert_F ^{2}\right)$ where $A_{\theta(t)}$ comes from GD and $\hat{A}_{\theta(t)}$ from the self-consistent dynamics. We observe that this distance is not only small for the region where our theoretical results apply but also almost everywhere outside this region.

Theorems & Definitions (59)

Remark
Theorem 1
proof
Remark
Theorem 2
Remark
Lemma A.1
proof
Lemma A.2
proof
...and 49 more

Mixed Dynamics In Linear Networks: Unifying the Lazy and Active Regimes

TL;DR

Abstract

Mixed Dynamics In Linear Networks: Unifying the Lazy and Active Regimes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (59)