Table of Contents
Fetching ...

Newton Losses: Using Curvature Information for Learning with Differentiable Algorithms

Felix Petersen, Christian Borgelt, Tobias Sutter, Hilde Kuehne, Oliver Deussen, Stefano Ermon

TL;DR

Newton Losses is presented, a method for improving the performance of existing hard to optimize losses by exploiting their second-order information via their empirical Fisher and Hessian matrices, while training the network with gradient descent.

Abstract

When training neural networks with custom objectives, such as ranking losses and shortest-path losses, a common problem is that they are, per se, non-differentiable. A popular approach is to continuously relax the objectives to provide gradients, enabling learning. However, such differentiable relaxations are often non-convex and can exhibit vanishing and exploding gradients, making them (already in isolation) hard to optimize. Here, the loss function poses the bottleneck when training a deep neural network. We present Newton Losses, a method for improving the performance of existing hard to optimize losses by exploiting their second-order information via their empirical Fisher and Hessian matrices. Instead of training the neural network with second-order techniques, we only utilize the loss function's second-order information to replace it by a Newton Loss, while training the network with gradient descent. This makes our method computationally efficient. We apply Newton Losses to eight differentiable algorithms for sorting and shortest-paths, achieving significant improvements for less-optimized differentiable algorithms, and consistent improvements, even for well-optimized differentiable algorithms.

Newton Losses: Using Curvature Information for Learning with Differentiable Algorithms

TL;DR

Newton Losses is presented, a method for improving the performance of existing hard to optimize losses by exploiting their second-order information via their empirical Fisher and Hessian matrices, while training the network with gradient descent.

Abstract

When training neural networks with custom objectives, such as ranking losses and shortest-path losses, a common problem is that they are, per se, non-differentiable. A popular approach is to continuously relax the objectives to provide gradients, enabling learning. However, such differentiable relaxations are often non-convex and can exhibit vanishing and exploding gradients, making them (already in isolation) hard to optimize. Here, the loss function poses the bottleneck when training a deep neural network. We present Newton Losses, a method for improving the performance of existing hard to optimize losses by exploiting their second-order information via their empirical Fisher and Hessian matrices. Instead of training the neural network with second-order techniques, we only utilize the loss function's second-order information to replace it by a Newton Loss, while training the network with gradient descent. This makes our method computationally efficient. We apply Newton Losses to eight differentiable algorithms for sorting and shortest-paths, achieving significant improvements for less-optimized differentiable algorithms, and consistent improvements, even for well-optimized differentiable algorithms.

Paper Structure

This paper contains 38 sections, 3 theorems, 24 equations, 8 figures, 8 tables, 2 algorithms.

Key Result

Lemma 1

Given a distribution over $\mathbb{R}^m$ with a probability density function $\mu$ of the form $\mu(\epsilon)=\exp(-\nu(\epsilon))$ for any twice-differentiable $\nu$, then

Figures (8)

  • Figure 1: Overview over ranking supervision with a differentiable sorting / ranking algorithm. A set of input images is (element-wise) processed by a CNN, producing a scalar for each image. The scalars are sorted / ranked by the differentiable ranking algorithm, which returns the differentiable permutation matrix, which is compared to the ground truth permutation matrix.
  • Figure 2: $12\times12$ Warcraft shortest-path problem. An input terrain map (left), unsupervised ground truth cost embedding (center) and ground truth supervised shortest path (right).
  • Figure 2: Shortest-path benchmark results for different variants of the AlgoVision-relaxed Bellman-Ford algorithm petersen2021learning. The metric is the percentage of perfect matches averaged over $10$ seeds. Significant improvements are bold black, and improved means are bold grey.
  • Figure 3: Test accuracy (perfect matches) plot for 'SS of loss' with $10$ samples on the Warcraft shortest-path benchmark. Lines show the mean and shaded areas show the 95% conf. intervals.
  • Figure 4: Ablation study wrt. the Tikhonov regularization strength hyperparameter $\lambda$. Displayed is the element-wise ranking accuracy (individual element ranks correctly identified), averaged over $10$ seeds, and additionally each seed with low opacity in the background. Left: NeuralSort. Right: SoftSort. Each for $n=5$. Newton Losses, and for both the Hessian and the Fisher variant, significantly improve over the baseline for up to (or beyond) 6 orders of magnitude in variation of its hyperparameter $\lambda$. Note the logarithmic horizontal axis.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Definition 1: Newton Losses (Hessian)
  • Definition 2: Newton Loss (Fisher)
  • Lemma 1: Exponential Family Smoothing, adapted from Lemma 1.5 in Abernethy et al. abernethy2016perturbation
  • Lemma 2: Gradient Descent Step Equality between \ref{['eq:training:update']} and \ref{['eq:training:update-z']}+\ref{['eq:training:update-theta']}
  • proof
  • Lemma 3: Newton Step Equality between \ref{['eq:training:update']} and \ref{['eq:training:update-z']}+\ref{['eq:training:update-theta']} for $m=1$
  • proof