Table of Contents
Fetching ...

Memory-Efficient Optimization with Factorized Hamiltonian Descent

Son Nguyen, Lizhang Chen, Bo Liu, Qiang Liu

TL;DR

A novel adaptive optimizer, H-Fac, is introduced, which incorporates a memory-efficient factorization approach to address this challenge of high memory overhead in training large-scale network models.

Abstract

Modern deep learning heavily depends on adaptive optimizers such as Adam and its variants, which are renowned for their capacity to handle model scaling and streamline hyperparameter tuning. However, these algorithms typically experience high memory overhead caused by the accumulation of optimization states, leading to a critical challenge in training large-scale network models. In this study, we introduce a novel adaptive optimizer, H-Fac, which incorporates a memory-efficient factorization approach to address this challenge. By employing a rank-1 parameterization for both momentum and scaling parameter estimators, H-Fac reduces memory costs to a sublinear level while maintaining competitive performance across a wide range of architectures. We develop our algorithms based on principles derived from Hamiltonian dynamics, providing robust theoretical underpinnings in optimization dynamics and convergence guarantees. These optimization algorithms are designed to be both straightforward and adaptable, facilitating easy implementation in diverse settings.

Memory-Efficient Optimization with Factorized Hamiltonian Descent

TL;DR

A novel adaptive optimizer, H-Fac, is introduced, which incorporates a memory-efficient factorization approach to address this challenge of high memory overhead in training large-scale network models.

Abstract

Modern deep learning heavily depends on adaptive optimizers such as Adam and its variants, which are renowned for their capacity to handle model scaling and streamline hyperparameter tuning. However, these algorithms typically experience high memory overhead caused by the accumulation of optimization states, leading to a critical challenge in training large-scale network models. In this study, we introduce a novel adaptive optimizer, H-Fac, which incorporates a memory-efficient factorization approach to address this challenge. By employing a rank-1 parameterization for both momentum and scaling parameter estimators, H-Fac reduces memory costs to a sublinear level while maintaining competitive performance across a wide range of architectures. We develop our algorithms based on principles derived from Hamiltonian dynamics, providing robust theoretical underpinnings in optimization dynamics and convergence guarantees. These optimization algorithms are designed to be both straightforward and adaptable, facilitating easy implementation in diverse settings.
Paper Structure (25 sections, 1 theorem, 37 equations, 5 figures, 4 tables, 5 algorithms)

This paper contains 25 sections, 1 theorem, 37 equations, 5 figures, 4 tables, 5 algorithms.

Key Result

Proposition 1

A key property is that the function $\mathcal{H}$ monotonically decreases along the ODE trajectory, that is, $\dfrac{\mathrm{d}}{\mathrm{d}t} \mathcal{H}(\bm{\mathrm{W}}_t, \bm{\mathrm{M}}_t, \bm{\mathrm{r}}_t,\bm{\mathrm{s}}_t) \leq 0$.

Figures (5)

  • Figure 1: A comparison of optimizer performance on ResNet architectures. For signSGD, $\beta$ denotes the momentum coefficient. For signFSDG, "ablation" means the version without corrected terms, "fullhead" means the version using full momentum for the MLP head layer.
  • Figure 2: Histograms illustrating the gradients of MLP head layers in ResNet50 (top) and ResNet101 (bottom) trained on CIFAR10 and CIFAR100, respectively.
  • Figure 3: Top-1 Accuracy of optimizers in training ResNet50, ViT-B/32, and ViT-S/16 from scratch on the ImageNet1K. For H-Fac, "ablation" means the version without corrected terms.
  • Figure 4: Training progression for pre-training LLaMA models on C4 dataset. Lower is better.
  • Figure 5: Performance of sign-based optimizers on ResNet architectures. "fullhead" means the version using full momentum for the MLP head layer.

Theorems & Definitions (1)

  • Proposition 1