Table of Contents
Fetching ...

How does training shape the Riemannian geometry of neural network representations?

Jacob A. Zavatone-Veth, Sheng Yang, Julian A. Rubinfien, Cengiz Pehlevan

TL;DR

The paper investigates how training reshapes the Riemannian geometry of neural representations by studying the metric induced on input space by neural feature maps. It establishes an infinite-width baseline where shallow networks produce spherically symmetric metrics, and empirically shows that training magnifies volume elements near decision boundaries across shallow, deep, and self-supervised settings. The results suggest that feature learning exploits nonlinear geometry to enhance discriminability near boundaries, offering a framework to understand and design geometric inductive biases. This geometry-centric perspective provides groundwork for principled analysis of generalization, robustness, and kernel-learning methods in neural representations.

Abstract

In machine learning, there is a long history of trying to build neural networks that can learn from fewer example data by baking in strong geometric priors. However, it is not always clear a priori what geometric constraints are appropriate for a given task. Here, we explore the possibility that one can uncover useful geometric inductive biases by studying how training molds the Riemannian geometry induced by unconstrained neural network feature maps. We first show that at infinite width, neural networks with random parameters induce highly symmetric metrics on input space. This symmetry is broken by feature learning: networks trained to perform classification tasks learn to magnify local areas along decision boundaries. This holds in deep networks trained on high-dimensional image classification tasks, and even in self-supervised representation learning. These results begin to elucidate how training shapes the geometry induced by unconstrained neural network feature maps, laying the groundwork for an understanding of this richly nonlinear form of feature learning.

How does training shape the Riemannian geometry of neural network representations?

TL;DR

The paper investigates how training reshapes the Riemannian geometry of neural representations by studying the metric induced on input space by neural feature maps. It establishes an infinite-width baseline where shallow networks produce spherically symmetric metrics, and empirically shows that training magnifies volume elements near decision boundaries across shallow, deep, and self-supervised settings. The results suggest that feature learning exploits nonlinear geometry to enhance discriminability near boundaries, offering a framework to understand and design geometric inductive biases. This geometry-centric perspective provides groundwork for principled analysis of generalization, robustness, and kernel-learning methods in neural representations.

Abstract

In machine learning, there is a long history of trying to build neural networks that can learn from fewer example data by baking in strong geometric priors. However, it is not always clear a priori what geometric constraints are appropriate for a given task. Here, we explore the possibility that one can uncover useful geometric inductive biases by studying how training molds the Riemannian geometry induced by unconstrained neural network feature maps. We first show that at infinite width, neural networks with random parameters induce highly symmetric metrics on input space. This symmetry is broken by feature learning: networks trained to perform classification tasks learn to magnify local areas along decision boundaries. This holds in deep networks trained on high-dimensional image classification tasks, and even in self-supervised representation learning. These results begin to elucidate how training shapes the geometry induced by unconstrained neural network feature maps, laying the groundwork for an understanding of this richly nonlinear form of feature learning.
Paper Structure (38 sections, 177 equations, 48 figures)

This paper contains 38 sections, 177 equations, 48 figures.

Figures (48)

  • Figure 1: Evolution of the volume element over training in a network with with architecture [2, 250, 2] across different epochs trained to classify points separated by a sinusoidal boundary $y=\frac{3}{5}\sin(7x - 1)$. Red lines indicate the decision boundaries of the network. See Appendix \ref{['app:xor']} for experimental details and additional visualizations.
  • Figure 2: Top panel: $\log_{10}(\sqrt{\det g})$ induced at interpolated images between 7 and 6 by a single-hidden-layer fully-connected network trained to classify MNIST digits. Bottom panel: Digit class predictions and $\log_{10}(\sqrt{\det g})$ for the plane spanned by MNIST digits 7, 6, and 1 at the final training epoch (200) . Sample images are visualized at the endpoints and midpoint for each set. Each line is colored by its prediction at the interpolated region and end points. As training progresses, the volume elements bulge in the middle (near the decision boundary) and taper off when travelling towards endpoints. See Appendix \ref{['app:mnist']} for experimental details and Figure \ref{['fig:more_mnist']} for images interpolated between other digits.
  • Figure 3: Top panel: $\log_{10}(\sqrt{\det g})$ induced at interpolated images between a horse and a frog by ResNet-34 with GELU activation trained to classify CIFAR-10 images. Bottom panel: Digits classification of a horse, a frog, and a car. The volume element is the largest at the intersection of several binary decision boundaries, and smallest within each of the decision region. The one-dimensional slices along the edges of each ternary plot are consistent with the top panel. See Appendix \ref{['app:resnet']} for experimental details, Figure \ref{['fig:more_cifar']} for linear interpolation and plane spanned by other classes, and how the plane evolves during training.
  • Figure 4: Visualization of volume elements across blocks of a ResNet-34 with GELU activations. Top panels: $\log_{10}(\sqrt{\det g})$ with class label predictions at interpolated samples between a car and a dog at the start of training, and from left to right lists volume elements across depth. Bottom panels: same quantities at the end of training (epoch 500). Our observation that volume elements are largest near the decision boundary is consistent across blocks, with contrast between the volume element at the test points and near the boundary increasing width depth. See Figure \ref{['fig:deep_resnet_2d']} for similar visualizations along two-dimensional slices through input space, and Appendix \ref{['app:resnet']} for experimental details.
  • Figure D.1: Volume element (left) and Ricci scalar $R$ (right) for erf networks with three hidden units on the unit circle and bias zero (top) or one (bottom). See text for full description of the setup.
  • ...and 43 more figures