Table of Contents
Fetching ...

Loss Barcode: A Topological Measure of Escapability in Loss Landscapes

Serguei Barannikov, Daria Voronkova, Alexander Mironenko, Ilya Trofimov, Alexander Korotin, Grigorii Sotnikov, Evgeny Burnaev

TL;DR

This paper uses the loss function topology to relate the local behavior of gradient descent trajectories with the global properties of the loss surface, and defines the neural network's Topological Obstructions score with the help of robust topological invariants, barcodes of the loss function, which quantify the escapability of local minima for gradient-based optimization.

Abstract

Neural network training is commonly based on SGD. However, the understanding of SGD's ability to converge to good local minima, given the non-convex nature of loss functions and the intricate geometric characteristics of loss landscapes, remains limited. In this paper, we apply topological data analysis methods to loss landscapes to gain insights into the learning process and generalization properties of deep neural networks. We use the loss function topology to relate the local behavior of gradient descent trajectories with the global properties of the loss surface. For this purpose, we define the neural network's Topological Obstructions score ("TO-score") with the help of robust topological invariants, barcodes of the loss function, which quantify the escapability of local minima for gradient-based optimization. Our two principal observations are: 1) the loss barcode of the neural network decreases with increasing depth and width, therefore the topological obstructions to learning diminish; 2) in certain situations there is a connection between the length of minima segments in the loss barcode and the minima's generalization errors. Our statements are based on extensive experiments with fully connected, convolutional, and transformer architectures and several datasets including MNIST, FMNIST, CIFAR10, CIFAR100, SVHN, and multilingual OSCAR text dataset.

Loss Barcode: A Topological Measure of Escapability in Loss Landscapes

TL;DR

This paper uses the loss function topology to relate the local behavior of gradient descent trajectories with the global properties of the loss surface, and defines the neural network's Topological Obstructions score with the help of robust topological invariants, barcodes of the loss function, which quantify the escapability of local minima for gradient-based optimization.

Abstract

Neural network training is commonly based on SGD. However, the understanding of SGD's ability to converge to good local minima, given the non-convex nature of loss functions and the intricate geometric characteristics of loss landscapes, remains limited. In this paper, we apply topological data analysis methods to loss landscapes to gain insights into the learning process and generalization properties of deep neural networks. We use the loss function topology to relate the local behavior of gradient descent trajectories with the global properties of the loss surface. For this purpose, we define the neural network's Topological Obstructions score ("TO-score") with the help of robust topological invariants, barcodes of the loss function, which quantify the escapability of local minima for gradient-based optimization. Our two principal observations are: 1) the loss barcode of the neural network decreases with increasing depth and width, therefore the topological obstructions to learning diminish; 2) in certain situations there is a connection between the length of minima segments in the loss barcode and the minima's generalization errors. Our statements are based on extensive experiments with fully connected, convolutional, and transformer architectures and several datasets including MNIST, FMNIST, CIFAR10, CIFAR100, SVHN, and multilingual OSCAR text dataset.

Paper Structure

This paper contains 29 sections, 2 theorems, 16 equations, 20 figures, 8 tables, 3 algorithms.

Key Result

Theorem 1

Let $L$ be a piece-wise smooth continuous function on a domain $D\subset \mathbb{R}^n$, $n\geq 5$, with $-\nabla(L)\vert_{\partial D}$ pointing outside the domain $D$, and such that, for all $r\geq 0$, index $r$ TO-score$(L)=0$. Then there exists an arbitrary small smooth perturbation of $L$ which,

Figures (20)

  • Figure 1: (a): The two local minima, indicated by circles, look the same locally but pose different difficulty to gradient-based optimization. The difficulty is quantified by the lengths of the green segments $s_p$, attached to these minima in $\text{Barcode}\,(L)$. $\text{Barcode}\,(L)$ for the simple loss landscape on the right is shown in subfigure (b).
  • Figure 2: Barcodes of fully connected deep neural networks (consisting of 2, 3, 4, 6, 8 layers) trained on MNIST and FMNIST datasets.
  • Figure 3: The effect of decreasing TO-score with the growth of number of layers in FC networks.
  • Figure 4: Barcodes of Convolutional Neural Networks on CIFAR10 dataset. Networks sorted in the order of growth of numbers of parameters.
  • Figure 5: Effect of Batch Normalization on the barcodes' height for convolutional neural networks.
  • ...and 15 more figures

Theorems & Definitions (12)

  • Definition 1
  • Definition 2
  • Remark 1
  • Definition 3
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Remark 2
  • Remark 3
  • ...and 2 more