Table of Contents
Fetching ...

Fisher-Rao Metric, Geometry, and Complexity of Neural Networks

Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, James Stokes

TL;DR

The paper introduces the Fisher-Rao norm as an information-geometric, invariant capacity measure for deep networks and links it to natural gradient and generalization.It provides an analytical FR-norm identity and shows FR serves as an umbrella for existing norm-based capacities, establishing norm-comparison inequalities across several geometries.The authors develop generalization bounds for deep linear and rectified networks via FR-based geometry, and validate theoretical insights with CIFAR-10 experiments demonstrating stable FR behavior under over-parameterization and correlation with generalization gaps.The work offers a unifying geometric perspective on neural network capacity and suggests invariant optimization approaches aligned with the Fisher-Rao geometry.

Abstract

We study the relationship between geometry and capacity measures for deep neural networks from an invariance viewpoint. We introduce a new notion of capacity --- the Fisher-Rao norm --- that possesses desirable invariance properties and is motivated by Information Geometry. We discover an analytical characterization of the new capacity measure, through which we establish norm-comparison inequalities and further show that the new measure serves as an umbrella for several existing norm-based complexity measures. We discuss upper bounds on the generalization error induced by the proposed measure. Extensive numerical experiments on CIFAR-10 support our theoretical findings. Our theoretical analysis rests on a key structural lemma about partial derivatives of multi-layer rectifier networks.

Fisher-Rao Metric, Geometry, and Complexity of Neural Networks

TL;DR

The paper introduces the Fisher-Rao norm as an information-geometric, invariant capacity measure for deep networks and links it to natural gradient and generalization.It provides an analytical FR-norm identity and shows FR serves as an umbrella for existing norm-based capacities, establishing norm-comparison inequalities across several geometries.The authors develop generalization bounds for deep linear and rectified networks via FR-based geometry, and validate theoretical insights with CIFAR-10 experiments demonstrating stable FR behavior under over-parameterization and correlation with generalization gaps.The work offers a unifying geometric perspective on neural network capacity and suggests invariant optimization approaches aligned with the Fisher-Rao geometry.

Abstract

We study the relationship between geometry and capacity measures for deep neural networks from an invariance viewpoint. We introduce a new notion of capacity --- the Fisher-Rao norm --- that possesses desirable invariance properties and is motivated by Information Geometry. We discover an analytical characterization of the new capacity measure, through which we establish norm-comparison inequalities and further show that the new measure serves as an umbrella for several existing norm-based complexity measures. We discuss upper bounds on the generalization error induced by the proposed measure. Extensive numerical experiments on CIFAR-10 support our theoretical findings. Our theoretical analysis rests on a key structural lemma about partial derivatives of multi-layer rectifier networks.

Paper Structure

This paper contains 15 sections, 11 theorems, 60 equations, 7 figures, 1 table.

Key Result

Lemma 2.1

Given a single data input $x \in \mathbb{R}^p$, consider the feedforward neural network in Definition def:mlp with activations satisfying $\sigma(z) = \sigma'(z) z$. Then for any $0\leq t \leq s \leq L$, one has the identity $\sum_{i \in [k_t], j \in [k_{t+1}]} \frac{\partial O^{s+1}}{\partial W^{t

Figures (7)

  • Figure 1: Dependence of different norms on width $k$ of hidden layers $(L=2)$ after optimizing with vanilla gradient descent (red) and natural gradient descent (blue).
  • Figure 2: Dependence of capacity measures on label randomization after optimizing with gradient descent. The colors show the effect of varying network width from $k=200$ (red) to $k=1000$ (blue) in increments of 100.
  • Figure 3: Dependence of different norms on depth $L$ ($k = 500$) after optimzing with vanilla gradient descent (red) and natural gradient descent (blue). The Fisher-Rao norms are normalized by $L+1$.
  • Figure 4: Dependence of capacity measures on label randomization after optimizing with natural gradient descent. The colors show the effect of varying network width from $k=200$ (red) to $k=1000$ (blue) in increments of 100. The natural gradient optimization clearly distinguishes the network architectures according to their Fisher-Rao norm.
  • Figure 5: Distribution of margins found by natural gradient (top) and vanilla gradient (bottom) before rescaling (left) and after rescaling by spectral norm (center) and empirical Fisher-Rao norm (right).
  • ...and 2 more figures

Theorems & Definitions (39)

  • Definition 1
  • Lemma 2.1: Structure in Gradient
  • Corollary 2.1: Large Margin Stationary Points
  • Corollary 2.2: Stationary Points for Deep Linear Networks
  • Remark 2.1
  • Remark 2.2
  • Definition 2
  • Theorem 3.1: Fisher-Rao norm
  • Remark 3.1
  • Corollary 3.1: Invariance
  • ...and 29 more