A second-order-like optimizer with adaptive gradient scaling for deep learning

Jérôme Bolte; Ryan Boustany; Edouard Pauwels; Andrei Purica

A second-order-like optimizer with adaptive gradient scaling for deep learning

Jérôme Bolte, Ryan Boustany, Edouard Pauwels, Andrei Purica

TL;DR

The paper introduces INNAprop, a second-order-like optimizer that blends the dynamical inertial Newton framework with RMSprop-style adaptive gradient scaling. It preserves a memory footprint comparable to AdamW while exploiting second-order information through time derivatives, enabling faster convergence without Hessian computations. Through broad experiments on CIFAR-10, ImageNet, ViT, Food101, GPT-2 pre-training, and LoRA-fine-tuning, INNAprop matches or surpasses AdamW with minimal hyperparameter tuning and demonstrates strong performance across both vision and language tasks. A continuous-time interpretation complements practical discretizations, and the work provides public code to facilitate adoption and further development in large-scale DL training.

Abstract

In this empirical article, we introduce INNAprop, an optimization algorithm that combines the INNA method with the RMSprop adaptive gradient scaling. It leverages second-order information and rescaling while keeping the memory requirements of standard DL methods as AdamW or SGD with momentum. After giving geometrical insights, we evaluate INNAprop on CIFAR-10, Food101, and ImageNet with ResNets, VGG, DenseNet, and ViT, and on GPT-2 (OpenWebText) train from scratch and with LoRA fine-tuning (E2E). INNAprop consistently matches or outperforms AdamW both in training speed and accuracy, with minimal hyperparameter tuning in large-scale settings. Our code is publicly available at \url{https://github.com/innaprop/innaprop}.

A second-order-like optimizer with adaptive gradient scaling for deep learning

TL;DR

Abstract

Paper Structure (49 sections, 45 equations, 19 figures, 5 tables, 4 algorithms)

This paper contains 49 sections, 45 equations, 19 figures, 5 tables, 4 algorithms.

Introduction
Continuous dynamical systems as optimization models.
Adaptive methods.
Our approach.
Relation with existing work.
Contributions.
INNAprop: a second-order method in space and time based on RMSProp
The algorithm
Derivation of the algorithm
Empirical evaluation of INNAprop
Tuning INNAprop on CIFAR-10 with VGG11 and ResNet18
Hyperparameter tuning:
Validation and comparison with AdamW:
Extensive experiments on large-scale vision models
Resnets on ImageNet:
...and 34 more sections

Figures (19)

Figure 1: Log-scale training loss and test accuracies for hyperparameters $(\alpha, \beta)$ with VGG11 on CIFAR10 at 20 and 200 epochs. Optimal learning rate $\gamma_0 = 10^{-3}$ and weight decay $\lambda = 0.01$, with one random seed.
Figure 2: Training VGG11 on CIFAR10. Left: train loss, middle: test accuracy (%), right: train accuracy (%), with 8 random seeds.
Figure 3: Training a ResNet50 (top) and ViT-B/32 (bottom) on ImageNet. Left: train loss, middle: Top-1 test accuracy (%), right: Top-1 train accuracy (%). 3 random seeds.
Figure 4: Finetuning a VGG11 on Food101. Left: train loss, middle: test accuracy (%), right: train accuracy (%). Qualitatively similar results for ResNet18 are in \ref{['fig:food101appendix']} in \ref{['sec:additionalExperiments']}. 3 random seeds.
Figure 5: GPT-2 training from scratch on OpenWebText.
...and 14 more figures

Theorems & Definitions (6)

Remark 1: Well posedness
Remark 2: On other possible discretizations
Remark 3: A family of algorithms indexed by $\alpha,\beta$
Remark 4: Trade-off between fast learning and good generalization
Remark 5
Remark 6

A second-order-like optimizer with adaptive gradient scaling for deep learning

TL;DR

Abstract

A second-order-like optimizer with adaptive gradient scaling for deep learning

Authors

TL;DR

Abstract

Table of Contents

Figures (19)

Theorems & Definitions (6)