Table of Contents
Fetching ...

Spectral Preconditioning for Gradient Methods on Graded Non-convex Functions

Nikita Doikov, Sebastian U. Stich, Martin Jaggi

TL;DR

This work proposes gradient methods with spectral preconditioning, which employ inexact top eigenvectors of the Hessian to address the ill-conditioning of the problem, contingent on the grade, and introduces a unique concept termed graded non-convexity to partition the class of non-convex problems into a nested chain of subclasses.

Abstract

The performance of optimization methods is often tied to the spectrum of the objective Hessian. Yet, conventional assumptions, such as smoothness, do often not enable us to make finely-grained convergence statements -- particularly not for non-convex problems. Striving for a more intricate characterization of complexity, we introduce a unique concept termed graded non-convexity. This allows to partition the class of non-convex problems into a nested chain of subclasses. Interestingly, many traditional non-convex objectives, including partially convex problems, matrix factorizations, and neural networks, fall within these subclasses. As a second contribution, we propose gradient methods with spectral preconditioning, which employ inexact top eigenvectors of the Hessian to address the ill-conditioning of the problem, contingent on the grade. Our analysis reveals that these new methods provide provably superior convergence rates compared to basic gradient descent on applicable problem classes, particularly when large gaps exist between the top eigenvalues of the Hessian. Our theory is validated by numerical experiments executed on multiple practical machine learning problems.

Spectral Preconditioning for Gradient Methods on Graded Non-convex Functions

TL;DR

This work proposes gradient methods with spectral preconditioning, which employ inexact top eigenvectors of the Hessian to address the ill-conditioning of the problem, contingent on the grade, and introduces a unique concept termed graded non-convexity to partition the class of non-convex problems into a nested chain of subclasses.

Abstract

The performance of optimization methods is often tied to the spectrum of the objective Hessian. Yet, conventional assumptions, such as smoothness, do often not enable us to make finely-grained convergence statements -- particularly not for non-convex problems. Striving for a more intricate characterization of complexity, we introduce a unique concept termed graded non-convexity. This allows to partition the class of non-convex problems into a nested chain of subclasses. Interestingly, many traditional non-convex objectives, including partially convex problems, matrix factorizations, and neural networks, fall within these subclasses. As a second contribution, we propose gradient methods with spectral preconditioning, which employ inexact top eigenvectors of the Hessian to address the ill-conditioning of the problem, contingent on the grade. Our analysis reveals that these new methods provide provably superior convergence rates compared to basic gradient descent on applicable problem classes, particularly when large gaps exist between the top eigenvalues of the Hessian. Our theory is validated by numerical experiments executed on multiple practical machine learning problems.
Paper Structure (33 sections, 15 theorems, 113 equations, 6 figures, 1 table)

This paper contains 33 sections, 15 theorems, 113 equations, 6 figures, 1 table.

Key Result

Proposition 2.2

Let $f \in \mathcal{F}_i$ and $g \in \mathcal{F}_j$, for some $0 \leq i, j \leq n$ such that $i + j \geq n$. Then, it holds: where ${\rm smax \,}(f, g)(\boldsymbol{x}) \stackrel{\mathrm{def}}{=} \ln( e^{f(\boldsymbol{x})} + e^{g(\boldsymbol{x})} )$ is the soft maximum of two functions.

Figures (6)

  • Figure 1: Spectrum of the Hessians for the matrix factorization problem (Example \ref{['ExampleMF']}) of the optimization dimension $n = 3000$, for $10$ random objectives.
  • Figure 2: Left: spectrum of the Hessians for a two-layer fully connected neural network trained on MNIST dataset. Right: zoomed top eigenvalues. The total number of parameters is $n = 12730$ (the first layer: $12560$, the second layer: $170$). We see that the dimension $\tau$ of the subspace with positive eigenvalues is much bigger than the dimension of the last layer. However, there are only a few eigenvalues that are significantly larger than the others.
  • Figure 3: A surface of a non-convex function $f$. At point $\boldsymbol{x}$ there is a subspace $V_{\boldsymbol{x}}$ where $f$ is convex.
  • Figure 4: The graph of two-dimensional function $f(x, y) = \sin ( x + y ) + q(x, y)$, where $q$ is a convex quadratic. The non-convex component has the structure of \ref{['VectorField']} with $\boldsymbol{u} \equiv (1, 1)^{\top}$.
  • Figure 5: Matrix factorization, $n = 400$. Left: convergence of the methods with spectral preconditioning for $\tau = 20, 80$. Right: spectrum of the Hessian at the initial point and after the last step.
  • ...and 1 more figures

Theorems & Definitions (23)

  • Definition 2.1
  • Proposition 2.2
  • Proposition 2.3
  • Proposition 2.4
  • Proposition 2.5
  • Example 2.6: Quadratic Functions
  • Example 2.7: Low-rank Vector Fields
  • Example 2.8: Partial Convexity
  • Example 2.9: Diagonal Neural Networks
  • Example 2.10: Matrix Factorizations
  • ...and 13 more