Table of Contents
Fetching ...

Corridor Geometry in Gradient-Based Optimization

Benoit Dherin, Mihaela Rosca

TL;DR

This work introduces corridors as geometric regions of loss landscapes where gradient descent exactly follows gradient flow, characterized analytically by $H(\theta) g(\theta) = 0$ and yielding linear loss decrease along GF trajectories. It proves that corridors are precisely the regions where GD and GF trajectories coincide, making GD's descent linear and free from certain implicit regularization effects within these zones. Leveraging this geometry, the authors propose Corridor Learning Rate (CLR), an adaptive step-size $h(\theta) = \frac{E(\theta)}{\|g(\theta)\|^2}$, which collapses to the Polyak step-size when $E(\theta^*)=0$ and demonstrates fast, stable convergence in deep learning experiments. The results provide a geometric lens on optimization dynamics, linking theoretical corridor properties to practical learning-rate strategies and suggesting directions for understanding when corridors arise in neural network training and their relation to implicit regularization.

Abstract

We characterize regions of a loss surface as corridors when the continuous curves of steepest descent -- the solutions of the gradient flow -- become straight lines. We show that corridors provide insights into gradient-based optimization, since corridors are exactly the regions where gradient descent and the gradient flow follow the same trajectory, while the loss decreases linearly. As a result, inside corridors there are no implicit regularization effects or training instabilities that have been shown to occur due to the drift between gradient descent and the gradient flow. Using the loss linear decrease on corridors, we devise a learning rate adaptation scheme for gradient descent; we call this scheme Corridor Learning Rate (CLR). The CLR formulation coincides with a special case of Polyak step-size, discovered in the context of convex optimization. The Polyak step-size has been shown recently to have also good convergence properties for neural networks; we further confirm this here with results on CIFAR-10 and ImageNet.

Corridor Geometry in Gradient-Based Optimization

TL;DR

This work introduces corridors as geometric regions of loss landscapes where gradient descent exactly follows gradient flow, characterized analytically by and yielding linear loss decrease along GF trajectories. It proves that corridors are precisely the regions where GD and GF trajectories coincide, making GD's descent linear and free from certain implicit regularization effects within these zones. Leveraging this geometry, the authors propose Corridor Learning Rate (CLR), an adaptive step-size , which collapses to the Polyak step-size when and demonstrates fast, stable convergence in deep learning experiments. The results provide a geometric lens on optimization dynamics, linking theoretical corridor properties to practical learning-rate strategies and suggesting directions for understanding when corridors arise in neural network training and their relation to implicit regularization.

Abstract

We characterize regions of a loss surface as corridors when the continuous curves of steepest descent -- the solutions of the gradient flow -- become straight lines. We show that corridors provide insights into gradient-based optimization, since corridors are exactly the regions where gradient descent and the gradient flow follow the same trajectory, while the loss decreases linearly. As a result, inside corridors there are no implicit regularization effects or training instabilities that have been shown to occur due to the drift between gradient descent and the gradient flow. Using the loss linear decrease on corridors, we devise a learning rate adaptation scheme for gradient descent; we call this scheme Corridor Learning Rate (CLR). The CLR formulation coincides with a special case of Polyak step-size, discovered in the context of convex optimization. The Polyak step-size has been shown recently to have also good convergence properties for neural networks; we further confirm this here with results on CIFAR-10 and ImageNet.
Paper Structure (15 sections, 6 theorems, 18 equations, 9 figures)

This paper contains 15 sections, 6 theorems, 18 equations, 9 figures.

Key Result

Lemma 2.8

Suppose that $\theta(t)$ is a solution of $\dot \theta = - g(\theta)$. Then $-Hg$ measures the rate of change of the loss gradient under the GF:

Figures (9)

  • Figure 1: Visualizing corridors. \ref{['fig:individual_corridors']}: Ruled surfaces forming a corridor, with example lines of steepest descent. \ref{['fig:multiple_corridors']}: A line of steepest descent (shown in red) formed on a loss surface constructed from multiple corridors.
  • Figure 2: The adaptive CLR converges for a ResNet-18 trained on CIFAR-10, and it does so quicker than SGD with a fixed learning rate obtained from a sweep. Batch size 4096. Results across a wide range of batch sizes can be found in Figure \ref{['fig:cifar_10_resnet_18_batch_size_sweep']}. Results obtained using 3 seeds.
  • Figure 3: The CLR converges for a ResNet-50 trained on Imagenet across batch sizes. We compare with vanilla SGD in Figure \ref{['fig:imagenet_comp']}: the CLR converges quicker than the fixed learning rates. Results obtained using 3 seeds.
  • Figure 4: CLR values adapt to training. Note: for ImageNet we cap the maximum value of the learning rate to 10. Results obtained using 3 seeds.
  • Figure 5: The corridor learning rate converges for a ResNet-50 trained on CIFAR-10.
  • ...and 4 more figures

Theorems & Definitions (21)

  • Definition 2.1
  • Example 2.2
  • Example 2.3
  • Example 2.4
  • Example 2.5
  • Example 2.6
  • Example 2.7
  • Lemma 2.8
  • Lemma 2.9
  • proof
  • ...and 11 more