Table of Contents
Fetching ...

Level Set Teleportation: An Optimization Perspective

Aaron Mishkin, Alberto Bietti, Robert M. Gower

TL;DR

Level Set Teleportation introduces an optimization primitive that speeds gradient descent by maximizing gradient magnitude over the current level set. The authors provide convergence guarantees for convex functions under Hessian stability, propose a practical SQP-like teleportation solver using only Hessian-vector products, and demonstrate improved performance over standard GD and competitive results with truncated Newton on a suite of convex and nonconvex problems. They also analyze approximate teleportation and provide extensive experiments across logistic regression, MLPs on MNIST, and UCI datasets, showing reliable speedups, especially when high-accuracy solutions are desired. The work bridges theory and practice, offering a nonexpansive, data-sensible operator that can accelerate first-order methods in a range of settings.

Abstract

We study level set teleportation, an optimization routine which tries to accelerate gradient descent (GD) by maximizing the gradient norm over a level set of the objective. While teleportation intuitively speeds-up GD via bigger steps, current work lacks convergence theory for convex functions, guarantees for solving the teleportation operator, and even clear empirical evidence showing this acceleration. We resolve these open questions. For convex functions satisfying Hessian stability, we prove that GD with teleportation obtains a combined sub-linear/linear convergence rate which is strictly faster than GD when the optimality gap is small. This is in sharp contrast to the standard (strongly) convex setting, where teleportation neither improves nor worsens convergence. To evaluate teleportation in practice, we develop a projected-gradient method requiring only Hessian-vector products. We use this to show that gradient methods with access to a teleportation oracle out-perform their standard versions on a variety of problems. We also find that GD with teleportation is faster than truncated Newton methods, particularly for non-convex optimization.

Level Set Teleportation: An Optimization Perspective

TL;DR

Level Set Teleportation introduces an optimization primitive that speeds gradient descent by maximizing gradient magnitude over the current level set. The authors provide convergence guarantees for convex functions under Hessian stability, propose a practical SQP-like teleportation solver using only Hessian-vector products, and demonstrate improved performance over standard GD and competitive results with truncated Newton on a suite of convex and nonconvex problems. They also analyze approximate teleportation and provide extensive experiments across logistic regression, MLPs on MNIST, and UCI datasets, showing reliable speedups, especially when high-accuracy solutions are desired. The work bridges theory and practice, offering a nonexpansive, data-sensible operator that can accelerate first-order methods in a range of settings.

Abstract

We study level set teleportation, an optimization routine which tries to accelerate gradient descent (GD) by maximizing the gradient norm over a level set of the objective. While teleportation intuitively speeds-up GD via bigger steps, current work lacks convergence theory for convex functions, guarantees for solving the teleportation operator, and even clear empirical evidence showing this acceleration. We resolve these open questions. For convex functions satisfying Hessian stability, we prove that GD with teleportation obtains a combined sub-linear/linear convergence rate which is strictly faster than GD when the optimality gap is small. This is in sharp contrast to the standard (strongly) convex setting, where teleportation neither improves nor worsens convergence. To evaluate teleportation in practice, we develop a projected-gradient method requiring only Hessian-vector products. We use this to show that gradient methods with access to a teleportation oracle out-perform their standard versions on a variety of problems. We also find that GD with teleportation is faster than truncated Newton methods, particularly for non-convex optimization.
Paper Structure (18 sections, 35 theorems, 159 equations, 19 figures, 2 algorithms)

This paper contains 18 sections, 35 theorems, 159 equations, 19 figures, 2 algorithms.

Key Result

Lemma 2.0

If $f$ is $L$-smooth and $\eta_k < 2/L$, then GD with tele-schedule $\mathcal{T}$ satisfies $\delta_{k+1} \leq \delta_k$, $w_k \in \mathcal{S}_0$, and $\|w_k - w^*\|_2 \leq R$ for every $k \in \mathbb{N}$.

Figures (19)

  • Figure 1: Initializing by level set teleportation two test functions. The Booth function is a convex quadratic and teleportation aligns $w_0^+$ with the maximum eigenvalue-eigenvector pair. The next iteration of GD is equivalent to a Newton update and converges in one step. The Goldstein-Price function is non-convex and teleporting pushes $w_0^+$ up a narrow "valley" from which convergence is slow. Newton's method diverges on the non-convex function.
  • Figure 2: One iteration of our method for solving teleportation problems on a convex quadratic. The algorithm combines gradient ascent with projections onto the linearization $l_t = \left\{w : f(x_t) + \left\langle\nabla f(x_t), w - x_t\right\rangle = f(w_k)\right\}$.
  • Figure 3: Sub-level Set Teleportation
  • Figure 4: Performance profile comparing optimization methods with (solid lines) and without (dashed lines) sub-level set teleportation for training three-layer ReLU networks. Stochastic methods teleport once every $10$ epochs starting from the epoch $5$, while deterministic methods teleport once every $50$ iterations starting from $k=5$. A problem is "solved"' when $\left(f(w_k) - f(w^*)\right) / (f(w_0) - f(w^*)) \leq \tau$, where $f(w^*)$ is estimated separately and $\tau$ is a threshold. Performance is judged by comparing time to a fixed proportion of problems solved (see dashed line at 50%). Algorithms with intermittent teleportation uniformly dominate their standard counterparts.
  • Figure 5: Performance of optimizers with (solid) and without (dashed) teleportation on MNIST. We train a MLP with the soft-plus activation and one hidden layer of size $500$. All methods are run in batch mode. Teleportation significantly improves the convergence speed of all methods and does not affect generalization performance.
  • ...and 14 more figures

Theorems & Definitions (59)

  • Lemma 2.0
  • Lemma 2.0
  • Proposition 2.0
  • Proposition 2.0
  • Proposition 2.0
  • Proposition 2.0
  • Definition 2.1
  • Lemma 2.1
  • Lemma 2.1
  • Theorem 2.2
  • ...and 49 more