Teleportation With Null Space Gradient Projection for Optimization Acceleration
Zihao Wu, Juncheng Dong, Ahmed Aloui, Vahid Tarokh
TL;DR
The paper addresses slow convergence in non-convex deep learning optimization by extending teleportation through null-space gradient projection, leveraging the teleportation objective $L_{Teleport} = \frac{1}{2}\|\nabla_W L_{primary}\|^2$. It introduces a projection operator that constrains teleportation updates to the input null space, preserving the loss within the invariant level set while enabling efficient cross-architectural applicability. The approach generalizes from MLPs to CNNs and Transformers and demonstrates significant efficiency gains with controllable error via the threshold parameter $\tau$. Experiments across MNIST, FashionMNIST, CIFAR, Tiny-Imagenet, electricity, traffic, and Penn Treebank show faster convergence and robust performance across optimizers such as SGD, Momentum, AdaGrad, and Adam.
Abstract
Optimization techniques have become increasingly critical due to the ever-growing model complexity and data scale. In particular, teleportation has emerged as a promising approach, which accelerates convergence of gradient descent-based methods by navigating within the loss invariant level set to identify parameters with advantageous geometric properties. Existing teleportation algorithms have primarily demonstrated their effectiveness in optimizing Multi-Layer Perceptrons (MLPs), but their extension to more advanced architectures, such as Convolutional Neural Networks (CNNs) and Transformers, remains challenging. Moreover, they often impose significant computational demands, limiting their applicability to complex architectures. To this end, we introduce an algorithm that projects the gradient of the teleportation objective function onto the input null space, effectively preserving the teleportation within the loss invariant level set and reducing computational cost. Our approach is readily generalizable from MLPs to CNNs, transformers, and potentially other advanced architectures. We validate the effectiveness of our algorithm across various benchmark datasets and optimizers, demonstrating its broad applicability.
