Table of Contents
Fetching ...

Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

Wenbo Gong, Meyer Scetbon, Chao Ma, Edward Meeds

TL;DR

This work reframes optimizer design for large language models as structured Fisher information matrix (FIM) approximation, linking existing optimizers to specific structural assumptions and proposing two design strategies. The first design path selects a balanced, memory-efficient structure (ssgd) that generalizes gradient normalization, while the second employs a novel low-rank extension framework to convert more general optimizers into memory-efficient variants, exemplified by the Alice optimizer. Empirical validation on LLaMA pretraining (C4) shows that Alice achieves about 2x faster convergence than Adam with modest memory overhead, and RACS delivers strong SGD-like performance at 1B scale; both outperform several memory-efficient baselines across multiple model sizes. The approach provides a principled lens to design efficient optimizers, enabling systematic exploration of new structures and low-rank schemes to scale up LLM training with reduced memory footprints and faster convergence.

Abstract

Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation (Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the effectiveness, showing faster and better convergence than existing memory-efficient baselines and Adam with little memory overhead. Notably, Alice achieves better than 2x faster convergence over Adam, while RACS delivers strong performance on the 1B model with SGD-like memory.

Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension

TL;DR

This work reframes optimizer design for large language models as structured Fisher information matrix (FIM) approximation, linking existing optimizers to specific structural assumptions and proposing two design strategies. The first design path selects a balanced, memory-efficient structure (ssgd) that generalizes gradient normalization, while the second employs a novel low-rank extension framework to convert more general optimizers into memory-efficient variants, exemplified by the Alice optimizer. Empirical validation on LLaMA pretraining (C4) shows that Alice achieves about 2x faster convergence than Adam with modest memory overhead, and RACS delivers strong SGD-like performance at 1B scale; both outperform several memory-efficient baselines across multiple model sizes. The approach provides a principled lens to design efficient optimizers, enabling systematic exploration of new structures and low-rank schemes to scale up LLM training with reduced memory footprints and faster convergence.

Abstract

Designing efficient optimizers for large language models (LLMs) with low-memory requirements and fast convergence is an important and challenging problem. This paper makes a step towards the systematic design of such optimizers through the lens of structured Fisher information matrix (FIM) approximation. We show that many state-of-the-art efficient optimizers can be viewed as solutions to FIM approximation (under the Frobenius norm) with specific structural assumptions. Building on these insights, we propose two design recommendations of practical efficient optimizers for LLMs, involving the careful selection of structural assumptions to balance generality and efficiency, and enhancing memory efficiency of optimizers with general structures through a novel low-rank extension framework. We demonstrate how to use each design approach by deriving new memory-efficient optimizers: Row and Column Scaled SGD (RACS) and Adaptive low-dimensional subspace estimation (Alice). Experiments on LLaMA pre-training (up to 1B parameters) validate the effectiveness, showing faster and better convergence than existing memory-efficient baselines and Adam with little memory overhead. Notably, Alice achieves better than 2x faster convergence over Adam, while RACS delivers strong performance on the 1B model with SGD-like memory.

Paper Structure

This paper contains 103 sections, 15 theorems, 126 equations, 6 figures, 11 tables, 9 algorithms.

Key Result

Proposition 1

Assuming $\mathcal{H} =\{\mathop{\mathrm{Diag_v}}\nolimits({\bm{v}}); v_i>0\}$, then eq: UFE equation has analytic solution where $\vec{\bm{g}}^2$ indicates the element-wise square of $\vec{\bm{g}}=\mathop{\mathrm{Vec}}\nolimits({\bm{G}})$.

Figures (6)

  • Figure 1: 1B LLaMA C4 pretraining evaluation ppl. curve. "+lm head" represents the last layer is trained by full-rank Adam.
  • Figure 2: Additional LLaMA C4 pretrain performance curve. (a), (b) and (c) represents the 60M, 130M and 350M, respectively. "+lm head" represents that the last layer of LLaMA is trained by full-rank Adam.
  • Figure 3: Throughput of various methods. (a) this reports the absolute throughput, representing the number of training token processed per second. (b) the effective throughput using Adam as the reference optimizer. This represents the absolute throughput adjusted by the speed-up factor. The effective throughput of GaLore and Fira is $0$ for some model sizes since they under-perform the Adam.
  • Figure 4: The memory footprint of various optimizers. We use token batch size of $256$ following the same setup as zhao2024galore under BF16 format. The suffix "layerwise" represents the memory consumption when enabling layerwise training so that only the gradient of the current layer is stored.
  • Figure 5: The pre-training curve to verify the effectiveness of the design choice. We consider 130M model size.
  • ...and 1 more figures

Theorems & Definitions (33)

  • Proposition 1: diagonal approximation
  • Theorem 3.1: Shampoo pre-conditioner
  • Proposition 2: Normalization and whitening
  • Theorem 3.2: 1-iteration refinement
  • Theorem 3.3: SOAP as 1-iteration alternating optimization of \ref{['eq: UFE equation']}
  • Proposition 3: Two-sided scaling
  • Proposition 4: Subspace switching
  • Theorem 5.1: Optimal compensation
  • proof
  • proof
  • ...and 23 more