Table of Contents
Fetching ...

Decoupling Dynamical Richness from Representation Learning: Towards Practical Measurement

Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, Niclas Goring, Ouns El Harzli, Abdurrahman Hadi Erturk, Soufiane Hayou, Ard A. Louis

TL;DR

A computationally efficient, performance-independent metric of richness grounded in the low-rank bias of rich dynamics, which recovers neural collapse as a special case is proposed and an eigendecomposition-based visualization is introduced to support interpretability.

Abstract

Dynamic feature transformation (the rich regime) does not always align with predictive performance (better representation), yet accuracy is often used as a proxy for richness, limiting analysis of their relationship. We propose a computationally efficient, performance-independent metric of richness grounded in the low-rank bias of rich dynamics, which recovers neural collapse as a special case. The metric is empirically more stable than existing alternatives and captures known lazy-torich transitions (e.g., grokking) without relying on accuracy. We further use it to examine how training factors (e.g., learning rate) relate to richness, confirming recognized assumptions and highlighting new observations (e.g., batch normalization promotes rich dynamics). An eigendecomposition-based visualization is also introduced to support interpretability, together providing a diagnostic tool for studying the relationship between training factors, dynamics, and representations.

Decoupling Dynamical Richness from Representation Learning: Towards Practical Measurement

TL;DR

A computationally efficient, performance-independent metric of richness grounded in the low-rank bias of rich dynamics, which recovers neural collapse as a special case is proposed and an eigendecomposition-based visualization is introduced to support interpretability.

Abstract

Dynamic feature transformation (the rich regime) does not always align with predictive performance (better representation), yet accuracy is often used as a proxy for richness, limiting analysis of their relationship. We propose a computationally efficient, performance-independent metric of richness grounded in the low-rank bias of rich dynamics, which recovers neural collapse as a special case. The metric is empirically more stable than existing alternatives and captures known lazy-torich transitions (e.g., grokking) without relying on accuracy. We further use it to examine how training factors (e.g., learning rate) relate to richness, confirming recognized assumptions and highlighting new observations (e.g., batch normalization promotes rich dynamics). An eigendecomposition-based visualization is also introduced to support interpretability, together providing a diagnostic tool for studying the relationship between training factors, dynamics, and representations.
Paper Structure (80 sections, 6 theorems, 60 equations, 25 figures, 10 tables, 1 algorithm)

This paper contains 80 sections, 6 theorems, 60 equations, 25 figures, 10 tables, 1 algorithm.

Key Result

Proposition 1

If $\mathcal{T}$ is an MP-operator, then the NC1 condition (collapse of within-class variability) $\Sigma_b^{\dagger} \Sigma_W=0$ holds, where inter-class covariance matrix $\Sigma_b$ and intra-class covariance matrix $\Sigma_W$ are

Figures (25)

  • Figure 1: Rich dynamics $\neq$ better generalization. We trained a 4-layer MLP on label-encoded MNIST. (a): The first 10 pixels are encoded with true labels in training and random labels in testing; both the encoding and the image serve as valid features for training. (b): A full backpropagation model (rich) biases toward the encodings and generalizes poorly, while a last-layer-only-trained model (lazy) relies on the full image and generalizes better. Our low-rank-based metric $\mathcal{D}_{LR}$ (\ref{['eq:our_measure']}) quantifies the dynamical richness ($\mathcal{D}_{LR} \in [0,1]$ where $0$ is richest) independent of the performance. (c): Complementary visualization method (\ref{['eq:Pfk']}): (i) Cumulative contribution of last-layer features in expressing the target function — top 10 features are irrelevant in the rich model. (ii) Contribution to the learned function — the rich model uses only the top 10 features, while the lazy model uses all. (iii) Relative feature norms — rich model concentrates on the top 10; lazy model decays more gradually. Test accuracies and $\mathcal{D}_{LR}$ values are shown in parentheses and square brackets, respectively. See \ref{['subsec:visualization']} for details, and \ref{['app:linear']} for background and motivation.
  • Figure 2: Learning curve and feature learning metric. (a): Learning curves of ResNet18 on CIFAR-10. Both error (a) and loss (b) learning curves show a transition to a faster-decaying power law with additional data near $n \approx 10^3$, correlating with the shift in decay of the richness measure $\mathcal{D}_{LR}$ in (c). This agrees with theoretical study on phase transition rubin2024grokking that a sufficiently large number of data points is critical for rich dynamics — a promising observation toward better understanding feature learning dynamics. A linear model (Gaussian process) was plotted in (a,b) to highlight the transition into faster-decaying learning curve.
  • Figure 3: Visualization of VGG16 on CIFAR-100 with and without batch normalization. We visualize the last row of \ref{['tab:big']}, where batch normalization shifts the model from the lazy to the rich regime. The eigenvalue distribution (iii) highlights this difference: with batch normalization, only 100 features are significant, whereas without it the eigenvalues decay slowly.
  • Figure 4: Visualization on the role of learning rate. We visualize the $2^{nd}$ row of \ref{['tab:big']} where the learning rates are varied (up to training instability) for ResNet18 on CIFAR-100. The second column (ii) shows that smallest learning rate uses significantly more eigenfunctions (features), while other models uses minimal 100 eigenfunctions, indicating a lazier dynamics.
  • Figure 5: Correlation among dynamics of feature quality, utilization, and intensity. We show individual metrics (e.g., $Q^*(k) := \Pi^*(k)-\Pi^*(k-1)$) instead of cumulative metrics ($\Pi^*(k)$ and $\hat{\Pi}(k)$) at different epochs for ResNet18 on CIFAR-100, normalized for better presentation. Larger intensity features exhibit higher quality and utilization during training.
  • ...and 20 more figures

Theorems & Definitions (11)

  • Definition 1: Minimum Projection (MP) operator
  • Proposition 1
  • Proposition 2
  • Lemma 1
  • proof
  • Corollary 1
  • proof
  • Proposition 2
  • proof
  • Proposition 2
  • ...and 1 more