Level Set Teleportation: An Optimization Perspective

Aaron Mishkin; Alberto Bietti; Robert M. Gower

Level Set Teleportation: An Optimization Perspective

Aaron Mishkin, Alberto Bietti, Robert M. Gower

TL;DR

Level Set Teleportation introduces an optimization primitive that speeds gradient descent by maximizing gradient magnitude over the current level set. The authors provide convergence guarantees for convex functions under Hessian stability, propose a practical SQP-like teleportation solver using only Hessian-vector products, and demonstrate improved performance over standard GD and competitive results with truncated Newton on a suite of convex and nonconvex problems. They also analyze approximate teleportation and provide extensive experiments across logistic regression, MLPs on MNIST, and UCI datasets, showing reliable speedups, especially when high-accuracy solutions are desired. The work bridges theory and practice, offering a nonexpansive, data-sensible operator that can accelerate first-order methods in a range of settings.

Abstract

We study level set teleportation, an optimization routine which tries to accelerate gradient descent (GD) by maximizing the gradient norm over a level set of the objective. While teleportation intuitively speeds-up GD via bigger steps, current work lacks convergence theory for convex functions, guarantees for solving the teleportation operator, and even clear empirical evidence showing this acceleration. We resolve these open questions. For convex functions satisfying Hessian stability, we prove that GD with teleportation obtains a combined sub-linear/linear convergence rate which is strictly faster than GD when the optimality gap is small. This is in sharp contrast to the standard (strongly) convex setting, where teleportation neither improves nor worsens convergence. To evaluate teleportation in practice, we develop a projected-gradient method requiring only Hessian-vector products. We use this to show that gradient methods with access to a teleportation oracle out-perform their standard versions on a variety of problems. We also find that GD with teleportation is faster than truncated Newton methods, particularly for non-convex optimization.

Level Set Teleportation: An Optimization Perspective

TL;DR

Abstract

Paper Structure (18 sections, 35 theorems, 159 equations, 19 figures, 2 algorithms)

This paper contains 18 sections, 35 theorems, 159 equations, 19 figures, 2 algorithms.

INTRODUCTION
Additional Related Work
LEVEL SET TELEPORTATION
Convergence for Convex Functions
Convergence under Hessian Stability
EVALUATING THE TELEPORTATION OPERATOR
Step-sizes and Termination
Approximate Teleportation
EXPERIMENTS
CONCLUSION
LEVEL SET TELEPORTATION: PROOFS
Convergence for Convex Functions: Proofs
Convergence under Hessian Stability: Proofs
EVALUATING THE TELEPORTATION OPERATOR: PROOFS
EXPERIMENTS
...and 3 more sections

Key Result

Lemma 2.0

If $f$ is $L$-smooth and $\eta_k < 2/L$, then GD with tele-schedule $\mathcal{T}$ satisfies $\delta_{k+1} \leq \delta_k$, $w_k \in \mathcal{S}_0$, and $\|w_k - w^*\|_2 \leq R$ for every $k \in \mathbb{N}$.

Figures (19)

Figure 1: Initializing by level set teleportation two test functions. The Booth function is a convex quadratic and teleportation aligns $w_0^+$ with the maximum eigenvalue-eigenvector pair. The next iteration of GD is equivalent to a Newton update and converges in one step. The Goldstein-Price function is non-convex and teleporting pushes $w_0^+$ up a narrow "valley" from which convergence is slow. Newton's method diverges on the non-convex function.
Figure 2: One iteration of our method for solving teleportation problems on a convex quadratic. The algorithm combines gradient ascent with projections onto the linearization $l_t = \left\{w : f(x_t) + \left\langle\nabla f(x_t), w - x_t\right\rangle = f(w_k)\right\}$.
Figure 3: Sub-level Set Teleportation
Figure 4: Performance profile comparing optimization methods with (solid lines) and without (dashed lines) sub-level set teleportation for training three-layer ReLU networks. Stochastic methods teleport once every $10$ epochs starting from the epoch $5$, while deterministic methods teleport once every $50$ iterations starting from $k=5$. A problem is "solved"' when $\left(f(w_k) - f(w^*)\right) / (f(w_0) - f(w^*)) \leq \tau$, where $f(w^*)$ is estimated separately and $\tau$ is a threshold. Performance is judged by comparing time to a fixed proportion of problems solved (see dashed line at 50%). Algorithms with intermittent teleportation uniformly dominate their standard counterparts.
Figure 5: Performance of optimizers with (solid) and without (dashed) teleportation on MNIST. We train a MLP with the soft-plus activation and one hidden layer of size $500$. All methods are run in batch mode. Teleportation significantly improves the convergence speed of all methods and does not affect generalization performance.
...and 14 more figures

Theorems & Definitions (59)

Lemma 2.0
Lemma 2.0
Proposition 2.0
Proposition 2.0
Proposition 2.0
Proposition 2.0
Definition 2.1
Lemma 2.1
Lemma 2.1
Theorem 2.2
...and 49 more

Level Set Teleportation: An Optimization Perspective

TL;DR

Abstract

Level Set Teleportation: An Optimization Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (59)