No learning rates needed: Introducing SALSA -- Stable Armijo Line Search Adaptation

Philip Kenneweg; Tristan Kenneweg; Fabian Fumagalli; Barbara Hammer

No learning rates needed: Introducing SALSA -- Stable Armijo Line Search Adaptation

Philip Kenneweg, Tristan Kenneweg, Fabian Fumagalli, Barbara Hammer

TL;DR

SaLSa tackles the burden of hand-tuned learning rates by augmenting Armijo line search with a momentum-based smoothing mechanism and adaptive cadence, enabling stable stochastic optimization without learning-rate schedules. It extends stochastic line search to Adam and introduces exponential smoothing of both loss decrease and gradient magnitude, along with a provable convergence framework and a practical, low-overhead cadence control ${L_k}$ via dual exponential moving averages. Empirically, SaLSa outperforms prior SLS and tuned optimizers across NLP transformers and image CNNs (e.g., CIFAR/ImageNet) with roughly a 3% compute overhead and about a 1.5% gain in accuracy and a 50% reduction in final log loss, while improving training stability on large-scale tasks. The work provides a public MIT-licensed PyTorch optimizer, making automatic step-size selection accessible for large-scale deep learning applications.

Abstract

In recent studies, line search methods have been demonstrated to significantly enhance the performance of conventional stochastic gradient descent techniques across various datasets and architectures, while making an otherwise critical choice of learning rate schedule superfluous. In this paper, we identify problems of current state-of-the-art of line search methods, propose enhancements, and rigorously assess their effectiveness. Furthermore, we evaluate these methods on orders of magnitude larger datasets and more complex data domains than previously done. More specifically, we enhance the Armijo line search method by speeding up its computation and incorporating a momentum term into the Armijo criterion, making it better suited for stochastic mini-batching. Our optimization approach outperforms both the previous Armijo implementation and a tuned learning rate schedule for the Adam and SGD optimizers. Our evaluation covers a diverse range of architectures, such as Transformers, CNNs, and MLPs, as well as data domains, including NLP and image data. Our work is publicly available as a Python package, which provides a simple Pytorch optimizer.

No learning rates needed: Introducing SALSA -- Stable Armijo Line Search Adaptation

TL;DR

via dual exponential moving averages. Empirically, SaLSa outperforms prior SLS and tuned optimizers across NLP transformers and image CNNs (e.g., CIFAR/ImageNet) with roughly a 3% compute overhead and about a 1.5% gain in accuracy and a 50% reduction in final log loss, while improving training stability on large-scale tasks. The work provides a public MIT-licensed PyTorch optimizer, making automatic step-size selection accessible for large-scale deep learning applications.

Abstract

Paper Structure (20 sections, 1 theorem, 17 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 1 theorem, 17 equations, 6 figures, 2 tables, 1 algorithm.

Introduction
Background
Armijo Line Search
Including preconditioned Optimizers (Adam)
SLS Failure Cases
Methods
Addressing Mini-batch Noise
Intuitive Motivation
Theoretical Analysis
Addressing Computational Costs
Practical Considerations
Experimental Approach
Candidates
Datasets and Models
Implementation Details
...and 5 more sections

Key Result

Theorem 1

Let $f \equiv f_k$. For $w_0 \in \mathbb{R}^d$ let $S(w_0) := \{w \mid f(w) \leq f(w_0)\}$ and assume that $f(w^*) := \inf_{w \in \mathbb{R}^d} f(w)$ exists for a unique point $w^* \in \mathbb{R}^d$ with $\nabla f(w) = 0$ for $w \in S(w_0)$ if and only if $w = w^*$. Any sequence $\{w_{k}\}_{k=1}^\in

Figures (6)

Figure 1: The step size of ADAM + SLS as well as ADAM + SaLSa on ImageNet. Colored areas indicate variance between runs. Notice the large variations for ADAM + SLS compared to the consistent and stable behavior of ADAM + SaLSa.
Figure 2: The step size of ADAM + SLS (black) compared to ADAM + SaLSa (brown) visualized during a training run of BERT on the MNLI dataset.
Figure 3: The step size of ADAM + SLS (black) on the CIFAR10 dataset. We sometimes observed drastic drops in step size due to computational precision problems. When training with SaLSa (red) we do not observe any such drops.
Figure 4: Wall clock run times on various datasets using the Speed-up described in \ref{['sec:speedup']}
Figure 5: The top row depicts the loss curves, while the bottom row depicts the accuracy curves from experiments conducted on the GLUE dataset. Standard errors are represented around each curve with a shaded area. Accuracy measurements are calculated on the validation data, loss calculations were based on the training data.
...and 1 more figures

Theorems & Definitions (2)

Theorem 1: Convergence Theorem
proof

No learning rates needed: Introducing SALSA -- Stable Armijo Line Search Adaptation

TL;DR

Abstract

No learning rates needed: Introducing SALSA -- Stable Armijo Line Search Adaptation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)