Table of Contents
Fetching ...

SANIA: Polyak-type Optimization Framework Leads to Scale Invariant Stochastic Algorithms

Farshed Abdukhakimov, Chulu Xiang, Dmitry Kamzolov, Robert Gower, Martin Takáč

TL;DR

The paper addresses the need for tunable learning rates in adaptive optimizers and the challenge of ill-conditioning in training neural models. It introduces SANIA, a general, parameter-free preconditioned Polyak framework that unifies first- and second-order methods and yields scale- and affine-invariant variants. Key contributions include the first stochastic Cubic Newton method with Polyak step-size, new scale-invariant AdaGrad-SQR and Adam-SQR variants, and SANIA PCG for Newton in convex and non-convex settings, supplemented by affine/scale-invariance proofs and comprehensive experiments on convex and non-convex tasks. The approach promises robust, tuning-free optimization across varying data bases and scalings, with practical impact for deep learning and generalized linear models.

Abstract

Adaptive optimization methods are widely recognized as among the most popular approaches for training Deep Neural Networks (DNNs). Techniques such as Adam, AdaGrad, and AdaHessian utilize a preconditioner that modifies the search direction by incorporating information about the curvature of the objective function. However, despite their adaptive characteristics, these methods still require manual fine-tuning of the step-size. This, in turn, impacts the time required to solve a particular problem. This paper presents an optimization framework named SANIA to tackle these challenges. Beyond eliminating the need for manual step-size hyperparameter settings, SANIA incorporates techniques to address poorly scaled or ill-conditioned problems. We also explore several preconditioning methods, including Hutchinson's method, which approximates the Hessian diagonal of the loss function. We conclude with an extensive empirical examination of the proposed techniques across classification tasks, covering both convex and non-convex contexts.

SANIA: Polyak-type Optimization Framework Leads to Scale Invariant Stochastic Algorithms

TL;DR

The paper addresses the need for tunable learning rates in adaptive optimizers and the challenge of ill-conditioning in training neural models. It introduces SANIA, a general, parameter-free preconditioned Polyak framework that unifies first- and second-order methods and yields scale- and affine-invariant variants. Key contributions include the first stochastic Cubic Newton method with Polyak step-size, new scale-invariant AdaGrad-SQR and Adam-SQR variants, and SANIA PCG for Newton in convex and non-convex settings, supplemented by affine/scale-invariance proofs and comprehensive experiments on convex and non-convex tasks. The approach promises robust, tuning-free optimization across varying data bases and scalings, with practical impact for deep learning and generalized linear models.

Abstract

Adaptive optimization methods are widely recognized as among the most popular approaches for training Deep Neural Networks (DNNs). Techniques such as Adam, AdaGrad, and AdaHessian utilize a preconditioner that modifies the search direction by incorporating information about the curvature of the objective function. However, despite their adaptive characteristics, these methods still require manual fine-tuning of the step-size. This, in turn, impacts the time required to solve a particular problem. This paper presents an optimization framework named SANIA to tackle these challenges. Beyond eliminating the need for manual step-size hyperparameter settings, SANIA incorporates techniques to address poorly scaled or ill-conditioned problems. We also explore several preconditioning methods, including Hutchinson's method, which approximates the Hessian diagonal of the loss function. We conclude with an extensive empirical examination of the proposed techniques across classification tasks, covering both convex and non-convex contexts.
Paper Structure (40 sections, 9 theorems, 85 equations, 9 figures, 2 tables, 4 algorithms)

This paper contains 40 sections, 9 theorems, 85 equations, 9 figures, 2 tables, 4 algorithms.

Key Result

lemma 1

The solution $\bar{w}$ of the next problem is the same as the solution $\hat{w}$ of where $\tau>0$.

Figures (9)

  • Figure 1: Accuracy of the models optimized by SANIA and other popular algorithms with different learning rates $lr \in [2^{-20}, 2^{5}]$ after $10$ epochs of training on colon-cancer with batch-size$=16$ and logistic regression criterion.
  • Figure 2: Comparison of classical and modified Adam and AdaGrad preconditioners used in SANIA optimizing logistic regression for synthetically generated dataset and Mushrooms (scaling factor $k=2$).
  • Figure 3: Observation of scale invariance of SANIA and performance of Adam on 2 datasets with the scaling factor $k=2$ and logistic regression objective function.
  • Figure 4: Performance of SANIA and other adaptive methods on 4 datasets (original and badly scaled with scaling factor $k=6$) with logistic regression loss.
  • Figure 5: Performance of SANIA and other adaptive methods on 2 LibSVM datasets (original and badly scaled with scaling factor $k=6$) with non-linear least squares loss.
  • ...and 4 more figures

Theorems & Definitions (10)

  • definition 1
  • lemma 1
  • lemma 2
  • lemma 3
  • lemma 4
  • lemma 5
  • lemma 6
  • lemma 7
  • lemma 8
  • lemma 9