Approaching Deep Learning through the Spectral Dynamics of Weights

David Yunis; Kumar Kshitij Patel; Samuel Wheeler; Pedro Savarese; Gal Vardi; Karen Livescu; Michael Maire; Matthew R. Walter

Approaching Deep Learning through the Spectral Dynamics of Weights

David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Savarese, Gal Vardi, Karen Livescu, Michael Maire, Matthew R. Walter

TL;DR

The paper investigates neural networks through the lens of spectral dynamics of weights, showing a robust bias toward low effective rank across diverse architectures and tasks. It reveals that weight decay acts as an implicit low-rank regularizer, connects rank minimization to grokking and generalization, and demonstrates that top singular directions govern critical phenomena like lottery tickets and linear mode connectivity. Across CNNs, UNets, LSTMs, and Transformers, the authors document persistent top-singular-vector stability and cross-layer alignment, even in nonlinear settings. These findings point to a unifying empirical framework with practical implications for regularization, model compression, and robust optimization, inviting a deeper theoretical understanding of spectral dynamics in deep learning.

Abstract

We propose an empirical approach centered on the spectral dynamics of weights -- the behavior of singular values and vectors during optimization -- to unify and clarify several phenomena in deep learning. We identify a consistent bias in optimization across various experiments, from small-scale ``grokking'' to large-scale tasks like image classification with ConvNets, image generation with UNets, speech recognition with LSTMs, and language modeling with Transformers. We also demonstrate that weight decay enhances this bias beyond its role as a norm regularizer, even in practical systems. Moreover, we show that these spectral dynamics distinguish memorizing networks from generalizing ones, offering a novel perspective on this longstanding conundrum. Additionally, we leverage spectral dynamics to explore the emergence of well-performing sparse subnetworks (lottery tickets) and the structure of the loss surface through linear mode connectivity. Our findings suggest that spectral dynamics provide a coherent framework to better understand the behavior of neural networks across diverse settings.

Approaching Deep Learning through the Spectral Dynamics of Weights

TL;DR

Abstract

Paper Structure (33 sections, 7 equations, 14 figures)

This paper contains 33 sections, 7 equations, 14 figures.

Introduction
Related Work
Grokking
Singular Value Dynamics
Low-Rank Properties
Grokking and Rank Minimization
Spectral Dynamics Across Tasks
Methodology
Effective Rank Minimization
Alignment of Singular Vectors Between Layers
The Effect of Weight Decay
Spectral Dynamics with Random Labels
Beyond Generalization
Top Singular Vectors Become Stable Earlier
Lottery Tickets Preserve Final Top Singular Vectors
...and 18 more sections

Figures (14)

Figure 1: Left: Schematic for the spectral dynamics of a weight matrix. As training proceeds top singular vectors become stable and top singular values grow disproportionately large. Right: Singular value evolution for a single matrix in a Transformer, where each line is a single singular value and color represents rank. We see a disproportionate trend where large singular values grow larger faster. We explore these spectral dynamics of weights and connect them to generalization, regularization, and seemingly unrelated phenomena like linear mode connectivity.
Figure 2: Grokking and Spectral Dynamics.Top row: 30% data and no weight decay. 2nd row: 30% data and weight decay 1.0 (grokking), using hyperparameters from nanda2023progress. 3rd row: 70% data with no weight decay (slingshot), using hyperparameters from thilak2022slingshot. Bottom row: 90% data and no weight decay. 2nd column: Singular value evolution is visualized for the first attention parameter, where each line represents a single singular value and the color represents the rank. 4th column: Alignment (Eqn. \ref{['eqn:alignment-matrix']}) between the embedding and the first attention parameter is also visualized, where the y-axis corresponds to index $i$ of the diagonal. 3rd column: One can see that grokking co-occurs with low-rank weights (effective rank is Eqn. \ref{['eqn:normalized-effective-rank']}). In addition, there is an alignment that begins early in training that evolves up the diagonal. Without weight decay and with less data, neither grokking nor the other phenomena occur during the entire training budget, but using more data, even without weight decay, leads to low-rank solutions from the beginning of training. The slingshot case follows a similar trend, though the validation loss is gradually fit. Across cases with good generalization, parameters are lower rank, and alignment is also more prevalent in the top ranks.
Figure 3: Top row: Singular value evolution for a single matrix in the middle of each model. Each line represents a singular value, whereas color represents rank. Notice the unequal evolution where top singular values grow at a disproportionate rate. Bottom row: Normalized effective rank (Eqn. \ref{['eqn:normalized-effective-rank']}) evolution visualized in color for different matrices across architectures and time. As we move down the $y$-axis, the depth of the parameters in the model increases, while the $x$-axis tracks training time. Notice decreasing effective rank across nearly all parameters, though the magnitude differs across layers. The block-like patterns in the VGG case are likely due to different channel dimension sizes. The banding in the UNet, LSTM, and Transformer cases is due to the differences between convolutional and linear layers, residual block connections, and attention and fully connected layers, respectively. The sharp transition midway through training in the VGG case is likely due to a 10$\times$ learning rate decay.
Figure 4: Top row: Training losses for all tasks. Bottom row: Validation losses for all tasks. Red is the full model. Blue is post-training pruning the bottom half of the SVD for every matrix in the model that is not the final layer. Green is post-training pruning the top half of the SVD. Notice that for all models, keeping the top half of the SVD is close to the full model performance, supporting the idea that the top directions provide a better approximation to the function.
Figure 5: Neighboring layer alignment of singular vectors. Top row: The diagonal of the alignment matrix $A(t)_{ii}$ (Eqn. \ref{['eqn:alignment-matrix']}) vs. training time for a single pair of matrices in the middle of each model. We see a small amount of alignment in the top ranks between layers shortly after training begins, but this becomes more diffuse over time. Bottom row: Alignment metric (Eqn. \ref{['eqn:alignment-measure']}) for pairs of matrices for depth vs. training time. It is hard to make out a global trend across models, though the LSTM shows a weak signal around Epoch 1 when the initial alignment occurs, and the Transformer case has a banding pattern with depth due to alignment between the query and key matrices that have no nonlinearity in between.
...and 9 more figures

Approaching Deep Learning through the Spectral Dynamics of Weights

TL;DR

Abstract

Approaching Deep Learning through the Spectral Dynamics of Weights

Authors

TL;DR

Abstract

Table of Contents

Figures (14)