Table of Contents
Fetching ...

Teleportation With Null Space Gradient Projection for Optimization Acceleration

Zihao Wu, Juncheng Dong, Ahmed Aloui, Vahid Tarokh

TL;DR

The paper addresses slow convergence in non-convex deep learning optimization by extending teleportation through null-space gradient projection, leveraging the teleportation objective $L_{Teleport} = \frac{1}{2}\|\nabla_W L_{primary}\|^2$. It introduces a projection operator that constrains teleportation updates to the input null space, preserving the loss within the invariant level set while enabling efficient cross-architectural applicability. The approach generalizes from MLPs to CNNs and Transformers and demonstrates significant efficiency gains with controllable error via the threshold parameter $\tau$. Experiments across MNIST, FashionMNIST, CIFAR, Tiny-Imagenet, electricity, traffic, and Penn Treebank show faster convergence and robust performance across optimizers such as SGD, Momentum, AdaGrad, and Adam.

Abstract

Optimization techniques have become increasingly critical due to the ever-growing model complexity and data scale. In particular, teleportation has emerged as a promising approach, which accelerates convergence of gradient descent-based methods by navigating within the loss invariant level set to identify parameters with advantageous geometric properties. Existing teleportation algorithms have primarily demonstrated their effectiveness in optimizing Multi-Layer Perceptrons (MLPs), but their extension to more advanced architectures, such as Convolutional Neural Networks (CNNs) and Transformers, remains challenging. Moreover, they often impose significant computational demands, limiting their applicability to complex architectures. To this end, we introduce an algorithm that projects the gradient of the teleportation objective function onto the input null space, effectively preserving the teleportation within the loss invariant level set and reducing computational cost. Our approach is readily generalizable from MLPs to CNNs, transformers, and potentially other advanced architectures. We validate the effectiveness of our algorithm across various benchmark datasets and optimizers, demonstrating its broad applicability.

Teleportation With Null Space Gradient Projection for Optimization Acceleration

TL;DR

The paper addresses slow convergence in non-convex deep learning optimization by extending teleportation through null-space gradient projection, leveraging the teleportation objective . It introduces a projection operator that constrains teleportation updates to the input null space, preserving the loss within the invariant level set while enabling efficient cross-architectural applicability. The approach generalizes from MLPs to CNNs and Transformers and demonstrates significant efficiency gains with controllable error via the threshold parameter . Experiments across MNIST, FashionMNIST, CIFAR, Tiny-Imagenet, electricity, traffic, and Penn Treebank show faster convergence and robust performance across optimizers such as SGD, Momentum, AdaGrad, and Adam.

Abstract

Optimization techniques have become increasingly critical due to the ever-growing model complexity and data scale. In particular, teleportation has emerged as a promising approach, which accelerates convergence of gradient descent-based methods by navigating within the loss invariant level set to identify parameters with advantageous geometric properties. Existing teleportation algorithms have primarily demonstrated their effectiveness in optimizing Multi-Layer Perceptrons (MLPs), but their extension to more advanced architectures, such as Convolutional Neural Networks (CNNs) and Transformers, remains challenging. Moreover, they often impose significant computational demands, limiting their applicability to complex architectures. To this end, we introduce an algorithm that projects the gradient of the teleportation objective function onto the input null space, effectively preserving the teleportation within the loss invariant level set and reducing computational cost. Our approach is readily generalizable from MLPs to CNNs, transformers, and potentially other advanced architectures. We validate the effectiveness of our algorithm across various benchmark datasets and optimizers, demonstrating its broad applicability.

Paper Structure

This paper contains 27 sections, 15 equations, 11 figures, 1 table, 1 algorithm.

Figures (11)

  • Figure 1: From left to right: symmetry teleport (slow and limited to MLPs), linear approximation of level set (prone to error), our algorithm that projects gradient onto the input null space (fast and accurate).
  • Figure 2: Loss trajectories of training MLPs on the MNIST and FashionMNIST datasets. Each experiment is repeated 3 times, with the average loss plotted and the standard deviation of loss represented as the shaded area.
  • Figure 3: From left to right: a comparison between symmetry teleport and our algorithm using MLPs in terms of the scaling of runtime with respect to $t$, $d$, $n$, $l$, and $b$.
  • Figure 4: Loss trajectories of training CNNs on CIFAR dataset and Tiny-Imagenet dataset. Each experiment is repeated 3 times, with the average loss plotted and the standard deviation of loss represented as the shaded area. Result of CIFAR100 is included in Appendix \ref{['sec:cnn_append']}.
  • Figure 5: Loss trajectories of training Transformers on sequential MNIST, electricity, traffic, and Penn Treebank datasets. Each experiment is repeated 3 times, with the average loss plotted and the standard deviation of loss represented as the shaded area.
  • ...and 6 more figures