Table of Contents
Fetching ...

When Representations Align: Universality in Representation Learning Dynamics

Loek van Rossem, Andrew M. Saxe

TL;DR

The paper tackles how deep networks develop similar representations despite architectural variety by proposing a universal, smooth-map based theory of representation learning in the expressive, highly-parameterized regime. It derives a two-point interaction model that reduces training dynamics to a self-contained 3D system describing representational distance, output difference, and alignment, and then validates the universal dynamics across multiple architectures and datasets. The work highlights rich versus lazy learning regimes controlled by initialization, derives an analytic form for the training loss, and demonstrates depth-dependent validity, showing that later layers align better with the effective theory and that depth reshapes effective learning rates. This universal perspective clarifies which aspects of representation learning are architecture-agnostic and emphasizes initial weight scale as a crucial factor for the emergent structure, with implications for understanding and designing scalable learning systems.

Abstract

Deep neural networks come in many sizes and architectures. The choice of architecture, in conjunction with the dataset and learning algorithm, is commonly understood to affect the learned neural representations. Yet, recent results have shown that different architectures learn representations with striking qualitative similarities. Here we derive an effective theory of representation learning under the assumption that the encoding map from input to hidden representation and the decoding map from representation to output are arbitrary smooth functions. This theory schematizes representation learning dynamics in the regime of complex, large architectures, where hidden representations are not strongly constrained by the parametrization. We show through experiments that the effective theory describes aspects of representation learning dynamics across a range of deep networks with different activation functions and architectures, and exhibits phenomena similar to the "rich" and "lazy" regime. While many network behaviors depend quantitatively on architecture, our findings point to certain behaviors that are widely conserved once models are sufficiently flexible.

When Representations Align: Universality in Representation Learning Dynamics

TL;DR

The paper tackles how deep networks develop similar representations despite architectural variety by proposing a universal, smooth-map based theory of representation learning in the expressive, highly-parameterized regime. It derives a two-point interaction model that reduces training dynamics to a self-contained 3D system describing representational distance, output difference, and alignment, and then validates the universal dynamics across multiple architectures and datasets. The work highlights rich versus lazy learning regimes controlled by initialization, derives an analytic form for the training loss, and demonstrates depth-dependent validity, showing that later layers align better with the effective theory and that depth reshapes effective learning rates. This universal perspective clarifies which aspects of representation learning are architecture-agnostic and emphasizes initial weight scale as a crucial factor for the emergent structure, with implications for understanding and designing scalable learning systems.

Abstract

Deep neural networks come in many sizes and architectures. The choice of architecture, in conjunction with the dataset and learning algorithm, is commonly understood to affect the learned neural representations. Yet, recent results have shown that different architectures learn representations with striking qualitative similarities. Here we derive an effective theory of representation learning under the assumption that the encoding map from input to hidden representation and the decoding map from representation to output are arbitrary smooth functions. This theory schematizes representation learning dynamics in the regime of complex, large architectures, where hidden representations are not strongly constrained by the parametrization. We show through experiments that the effective theory describes aspects of representation learning dynamics across a range of deep networks with different activation functions and architectures, and exhibits phenomena similar to the "rich" and "lazy" regime. While many network behaviors depend quantitatively on architecture, our findings point to certain behaviors that are widely conserved once models are sufficiently flexible.
Paper Structure (57 sections, 51 equations, 18 figures, 2 tables)

This paper contains 57 sections, 51 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Overview of the effective theory for the two point interaction. For two datapoints $x_1$ and $x_2$ self-contained dynamics are defined on their representational difference $||dh||$, predicted output difference $||dy||$, and output alignment $w$.
  • Figure 2: Universal learning dynamics among different architectures. The representational distance $||dh||^2$, prediction difference $||dy||^2$ and output alignment $w$ during training on a two point dataset amongst architectures with varying connectivity (top) and nonlinearities (bottom), matches the theory after fitting two constants. The architectures used are all variations of the default architecture and initialized at small weights so as to be in the expressive feature learning regime. Details for all experiments can be found in \ref{['sec:experiment_details']}.
  • Figure 3: Dynamics of the 3-dimensional system (top) and training loss (bottom) at varying initial weights. The default architecture (20 fully connected layers, 500 units per layer, leaky ReLU) is trained on two datapoints and compared to the effective theory, after fitting two effective learning rates for the 3-dimensional system and one additional effective learning rate for the loss. Plateau-like behavior in the representational distance and loss can be seen at small initializations, but disappears at larger initial weights. At very high initial weights, when the representational distance starts off already large, the approximation breaks down as expected.
  • Figure 4: Learning dynamics of a randomly selected zero digit and one digit in MNIST compared against the theory after fitting two constants. The deep neural network has 4 fully connected layers, 100 units per layer, leaky ReLU activation and is initialized at small weights.
  • Figure 5: Structured learning at small initial weights. The final representational distance of two point experiments with varying input and target output distances compared against the theory (\ref{['eq:final_distance']}). Each dot represents a single trial, i.e. one training run. The effective learning rates for each architecture have been fit by averaging over 50 trials at fixed input and target output distances.
  • ...and 13 more figures