Table of Contents
Fetching ...

FOAM: Blocked State Folding for Memory-Efficient LLM Training

Ziqing Wen, Jiahuan Wang, Ping Luo, Dongsheng Li, Tao Sun

TL;DR

<3-5 sentence high-level summary> FOAM tackles the optimizer-state memory bottleneck in LLM training by folding optimizer moments into blocks and adding a residual correction to recover lost information. The method preserves Adam-like convergence while reducing memory usage by about 50% and optimizer overhead by up to 90%, and it is compatible with other memory-efficient optimizers. Theoretical analysis shows FOAM retains Adam’s convergence rate under standard non-convex assumptions, and extensive experiments across pre-training and fine-tuning (LLaMA, Qwen, RoBERTa, and long-context/quantization scenarios) demonstrate improved memory efficiency, faster convergence, and robust performance. Overall, FOAM offers a practical, optimizer-agnostic solution to memory bottlenecks in large-scale language model training.

Abstract

Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM reduces total training memory by approximately 50\%, eliminates up to 90\% of optimizer state memory overhead, and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines.

FOAM: Blocked State Folding for Memory-Efficient LLM Training

TL;DR

<3-5 sentence high-level summary> FOAM tackles the optimizer-state memory bottleneck in LLM training by folding optimizer moments into blocks and adding a residual correction to recover lost information. The method preserves Adam-like convergence while reducing memory usage by about 50% and optimizer overhead by up to 90%, and it is compatible with other memory-efficient optimizers. Theoretical analysis shows FOAM retains Adam’s convergence rate under standard non-convex assumptions, and extensive experiments across pre-training and fine-tuning (LLaMA, Qwen, RoBERTa, and long-context/quantization scenarios) demonstrate improved memory efficiency, faster convergence, and robust performance. Overall, FOAM offers a practical, optimizer-agnostic solution to memory bottlenecks in large-scale language model training.

Abstract

Large language models (LLMs) have demonstrated remarkable performance due to their large parameter counts and extensive training data. However, their scale leads to significant memory bottlenecks during training, especially when using memory-intensive optimizers like Adam. Existing memory-efficient approaches often rely on techniques such as singular value decomposition (SVD), projections, or weight freezing, which can introduce substantial computational overhead, require additional memory for projections, or degrade model performance. In this paper, we propose Folded Optimizer with Approximate Moment (FOAM), a method that compresses optimizer states by computing block-wise gradient means and incorporates a residual correction to recover lost information. Theoretically, FOAM achieves convergence rates equivalent to vanilla Adam under standard non-convex optimization settings. Empirically, FOAM reduces total training memory by approximately 50\%, eliminates up to 90\% of optimizer state memory overhead, and accelerates convergence. Furthermore, FOAM is compatible with other memory-efficient optimizers, delivering performance and throughput that match or surpass both full-rank and existing memory-efficient baselines.

Paper Structure

This paper contains 30 sections, 4 theorems, 65 equations, 11 figures, 14 tables, 1 algorithm.

Key Result

Theorem 4.4

Let $\{W_{t}\}_{t\geq 1}$ be generated by Algorithm algo:fold_adam_algo, under the assumptions of Assumptions ass:lipschitz - ass:bound_gradient, and with $\eta_{t} = \eta_0 / \sqrt{t}$. Then, we have

Figures (11)

  • Figure 1: Overview of FOAM optimizer with a fold level of $l$.
  • Figure 2: FOAM performance preview on LLM pre-training. Figure (a) and (b): Perplexity learning curves for pre-training LLaMA-350M and LLaMA-1.3B on C4. FOAM demonstrates superior validation perplexity compared with other baselines. Figure (c): optimizer memory footprint for pre-training LLaMA models. FOAM achieves an approximate 50% reduction in overall training memory consumption, and FOAM-Mini further pushes the limit by almost eliminating the memory overhead associated with optimizer states.
  • Figure 3: Additional Investigation of FOAM.(a) Impact of the FOAM level $l$: FOAM exhibits strong robustness across varying memory constraints. (b) Extended training of LLaMA-130M on 39B tokens. (c) Integration of FOAM with Adam-Mini and MUON.
  • Figure 4: Validation PPL with or without residual.
  • Figure 5: Cosine Similarities between the Update Matrices of FOAM with or without Residual and Adam. We report the average similarity across all modules within each layer. As observed, the update matrices including the residual term exhibit a higher cosine similarity with Adam’s updates compared to those without the residual. Specifically, for the setting $l=3$, FOAM updates maintain a cosine similarity greater than $0.5$ with standard Adam, despite retaining only $1/8$ of the original Adam optimizer state.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Theorem 4.4: Convergence of FOAM
  • Lemma 1.1
  • proof
  • Lemma 1.2
  • proof
  • Lemma 1.3
  • proof
  • proof