The Training Process of Many Deep Networks Explores the Same Low-Dimensional Manifold
Jialin Mao, Itay Griniasty, Han Kheng Teoh, Rahul Ramesh, Rubing Yang, Mark K. Transtrum, James P. Sethna, Pratik Chaudhari
TL;DR
This paper reveals that training trajectories of diverse deep networks evolve on an effectively low-dimensional manifold in the space of predictions. By framing networks as probabilistic predictors and employing information-geometric distances, the authors construct isometric InPCA embeddings that faithfully preserve global geometry, enabling visualization and comparison of thousands of models across architectures and training regimes. They show that, despite architectural and optimization diversity, trajectories cluster along similar manifolds, with top components capturing the majority of the variation; larger networks train faster but follow the same path as smaller ones, and different initializations quickly merge into the same manifold. These findings imply a reduced effective complexity in training dynamics and offer a geometric lens for understanding generalization and optimization in deep learning. The work also introduces practical tools for embedding and comparing high-dimensional probabilistic models, with potential implications for designing training protocols and ensembles.
Abstract
We develop information-geometric techniques to analyze the trajectories of the predictions of deep networks during training. By examining the underlying high-dimensional probabilistic models, we reveal that the training process explores an effectively low-dimensional manifold. Networks with a wide range of architectures, sizes, trained using different optimization methods, regularization techniques, data augmentation techniques, and weight initializations lie on the same manifold in the prediction space. We study the details of this manifold to find that networks with different architectures follow distinguishable trajectories but other factors have a minimal influence; larger networks train along a similar manifold as that of smaller networks, just faster; and networks initialized at very different parts of the prediction space converge to the solution along a similar manifold.
