A second-order-like optimizer with adaptive gradient scaling for deep learning
Jérôme Bolte, Ryan Boustany, Edouard Pauwels, Andrei Purica
TL;DR
The paper introduces INNAprop, a second-order-like optimizer that blends the dynamical inertial Newton framework with RMSprop-style adaptive gradient scaling. It preserves a memory footprint comparable to AdamW while exploiting second-order information through time derivatives, enabling faster convergence without Hessian computations. Through broad experiments on CIFAR-10, ImageNet, ViT, Food101, GPT-2 pre-training, and LoRA-fine-tuning, INNAprop matches or surpasses AdamW with minimal hyperparameter tuning and demonstrates strong performance across both vision and language tasks. A continuous-time interpretation complements practical discretizations, and the work provides public code to facilitate adoption and further development in large-scale DL training.
Abstract
In this empirical article, we introduce INNAprop, an optimization algorithm that combines the INNA method with the RMSprop adaptive gradient scaling. It leverages second-order information and rescaling while keeping the memory requirements of standard DL methods as AdamW or SGD with momentum. After giving geometrical insights, we evaluate INNAprop on CIFAR-10, Food101, and ImageNet with ResNets, VGG, DenseNet, and ViT, and on GPT-2 (OpenWebText) train from scratch and with LoRA fine-tuning (E2E). INNAprop consistently matches or outperforms AdamW both in training speed and accuracy, with minimal hyperparameter tuning in large-scale settings. Our code is publicly available at \url{https://github.com/innaprop/innaprop}.
