The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold

Jialin Mao; Itay Griniasty; Han Kheng Teoh; Rahul Ramesh; Rubing Yang; Mark K. Transtrum; James P. Sethna; Pratik Chaudhari

The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold

Jialin Mao, Itay Griniasty, Han Kheng Teoh, Rahul Ramesh, Rubing Yang, Mark K. Transtrum, James P. Sethna, Pratik Chaudhari

TL;DR

This paper reveals that training trajectories of diverse deep networks evolve on an effectively low-dimensional manifold in the space of predictions. By framing networks as probabilistic predictors and employing information-geometric distances, the authors construct isometric InPCA embeddings that faithfully preserve global geometry, enabling visualization and comparison of thousands of models across architectures and training regimes. They show that, despite architectural and optimization diversity, trajectories cluster along similar manifolds, with top components capturing the majority of the variation; larger networks train faster but follow the same path as smaller ones, and different initializations quickly merge into the same manifold. These findings imply a reduced effective complexity in training dynamics and offer a geometric lens for understanding generalization and optimization in deep learning. The work also introduces practical tools for embedding and comparing high-dimensional probabilistic models, with potential implications for designing training protocols and ensembles.

Abstract

We develop information-geometric techniques to analyze the trajectories of the predictions of deep networks during training. By examining the underlying high-dimensional probabilistic models, we reveal that the training process explores an effectively low-dimensional manifold. Networks with a wide range of architectures, sizes, trained using different optimization methods, regularization techniques, data augmentation techniques, and weight initializations lie on the same manifold in the prediction space. We study the details of this manifold to find that networks with different architectures follow distinguishable trajectories but other factors have a minimal influence; larger networks train along a similar manifold as that of smaller networks, just faster; and networks initialized at very different parts of the prediction space converge to the solution along a similar manifold.

The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold

TL;DR

Abstract

Paper Structure (37 sections, 1 theorem, 26 equations, 29 figures, 2 tables)

This paper contains 37 sections, 1 theorem, 26 equations, 29 figures, 2 tables.

Measuring distances in the prediction space
Measuring distances between trajectories in the prediction space
Embedding predictions into a lower-dimensional space for visualization
Adding new networks into an existing embedding
Computing averages in the prediction space
Characterizing the details of the train manifold
Embedding probabilistic models along train and test trajectories into the same space
A new insight into optimization in deep learning
Computational Information Geometry
Interpretation of the top three principal coordinates
Why are the train and test manifolds effectively low-dimensional?
Notation
Derivation of the joint probability of predictions and the Bhattacharyya distance
Details of the experimental setup
Datasets
...and 22 more sections

Key Result

Theorem 1

Given a finite symmetric premetric space $\mathcal{M} = (M, D)$ with $\abs{M} = m$ points, if $D \in \mathbb{R}^{m \times m}$ is the matrix of pairwise distances between these points, then the eigen-embedding of $W = -LDL/2$ where $L_{ij} = \delta_{ij} - 1/m$ is the centering matrix, is isometric to

Figures (29)

Figure 1: A schematic of the procedure in \ref{['eq:tw']} used to compute progress $s_w$ by projecting a model $P_w$ along a training trajectory onto the geodesic between ignorance $P_0$ and truth $P_*$.
Figure 2: The manifold of models along training trajectories of networks with different configurations (architectures denoted by different colors, optimization algorithms, hyper-parameters, and regularization mechanisms) is effectively low-dimensional for (a) CIFAR-10, and (d) ImageNet. Different configurations train along similar trajectories but are quite different from the geodesic between ignorance $P_0$ and truth $P_*$ (not seen here). The manifold is hyper-ribbon-like transtrumGeometryNonlinearLeast2011: eigenvalues of the InPCA distance matrix \ref{['eq:w']} for CIFAR-10 (b) and ImageNet (e) are spread over a large range with the top few dimensions capturing a large fraction of the stress \ref{['eq:explained_stress']} (numbers indicate explained stress in the top 1, 3, 10, 25 and 50 dimensions). Time-like coordinates corresponding to negative InPCA eigenvalues are red. (c): a pairwise comparison for the first three principal components, note that PC2 is time-like (same data as (a)). In (a,d), we have drawn smooth curves denoting trajectories by hand to guide the reader.
Figure 3: Comparison of the top two principal components of an InPCA embedding of all models on CIFAR-10 colored by the architectures (a) (same as \ref{['fig:all_models_train_2d']}), train loss (b), which is two times the Bhattacharyya distance $\text{d}_{\text{B}}(P, P_*)$ for classification tasks like ours, train error in (c), and by whether they are within a Bhattacharyya distance < 0.15 from models marked A, B, and C on the geodesic in (d). These figures are discussed in the narrative and should be studied together with \ref{['fig:all_models_train_2d']}.
Figure 4: Number of models $P$ with $\text{d}_{\text{B}}(P, P_*) > 2$ (that are away from the main manifold) stratified by (a) architectures and (b) the number of epochs.
Figure 5: Predictions on the test data of networks with different configurations (architectures denoted by different colors, different optimization algorithms and regularization mechanisms) on CIFAR-10 in (a) and on ImageNet in (d) is also effectively low-dimensional. Trajectories of different architectures are distinctive on the test data. Test manifold is also hyper-ribbon-like: eigenvalues of the InPCA distance matrix \ref{['eq:w']} for CIFAR-10 (b) and ImageNet (e) are spread over a large range and the top few dimensions capture a large fraction of the stress \ref{['eq:explained_stress']} (numbers indicate explained stress in the top 1, 3, 10, 25 and 50 dimensions. (c) shows a pairwise comparison for the first three principal components for CIFAR-10 models. PC1-PC2 of \ref{['fig:all_models_train_2d']} look quite similar to those of (c). In (a,d), we have drawn smooth curves denoting trajectories by hand to guide the reader.
...and 24 more figures

Theorems & Definitions (2)

Theorem 1
proof

The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold

TL;DR

Abstract

The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (29)

Theorems & Definitions (2)