4-bit Shampoo for Memory-Efficient Network Training

Sike Wang; Pan Zhou; Jia Li; Hua Huang

4-bit Shampoo for Memory-Efficient Network Training

Sike Wang, Pan Zhou, Jia Li, Hua Huang

TL;DR

Second-order optimizers offer fast convergence but incur high memory from storing the preconditioner $A$ and its inverse-4th root $A^{-1/4}$. The authors propose 4-bit Shampoo by quantizing the preconditioner’s eigenvector matrix $U$ and applying Björck orthonormalization, enabling accurate inverse-root computations with substantially reduced memory. Theoretical analysis shows perturbing $U$ yields smaller errors in $A^s$ than perturbing $A$, and empirical results demonstrate that 4-bit Shampoo preserves 32-bit performance across CNNs and vision transformers while achieving meaningful memory savings. This approach advances memory-efficient training with second-order optimizers, enabling larger models under the same hardware budgets.

Abstract

Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models trained by second-order optimizers. To address this, compressing 32-bit optimizer states to lower bitwidths has shown promise in reducing memory usage. However, current approaches only pertain to first-order optimizers. In this paper, we propose the first 4-bit second-order optimizers, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones. We show that quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo is remarkably better than quantizing the preconditioner itself both theoretically and experimentally. By rectifying the orthogonality of the quantized eigenvector matrix, we enhance the approximation of the preconditioner's eigenvector matrix, which also benefits the computation of its inverse 4-th root. Besides, we find that linear square quantization slightly outperforms dynamic tree quantization when quantizing second-order optimizer states. Evaluation on various networks for image classification and natural language modeling demonstrates that our 4-bit Shampoo achieves comparable performance to its 32-bit counterpart while being more memory-efficient.

4-bit Shampoo for Memory-Efficient Network Training

TL;DR

Second-order optimizers offer fast convergence but incur high memory from storing the preconditioner

and its inverse-4th root

. The authors propose 4-bit Shampoo by quantizing the preconditioner’s eigenvector matrix

and applying Björck orthonormalization, enabling accurate inverse-root computations with substantially reduced memory. Theoretical analysis shows perturbing

yields smaller errors in

than perturbing

, and empirical results demonstrate that 4-bit Shampoo preserves 32-bit performance across CNNs and vision transformers while achieving meaningful memory savings. This approach advances memory-efficient training with second-order optimizers, enabling larger models under the same hardware budgets.

Abstract

Paper Structure (25 sections, 19 theorems, 60 equations, 10 figures, 13 tables, 6 algorithms)

This paper contains 25 sections, 19 theorems, 60 equations, 10 figures, 13 tables, 6 algorithms.

Introduction
Preliminaries
Shampoo for Matrices
Quantization-based Compression Methods
Methodology
Quantizing the Eigenvector Matrices
Rectifying the Orthogonality of Eigenvector Matrices
Selecting the Quantizer
Overall Algorithm
Theoretical Analysis
Experiments
Related Work
Conclusions, Limitations, and Broader Impact
Implementation Details of Shampoo, CASPR, K-FAC and AdaBK
Randomized SVD Method
...and 10 more sections

Key Result

Lemma 1

Let $\bm{A}$ be a PD matrix whose SVD is $\bm{U}\bm{\Lambda}\bm{U}^{\mathsf{T}}$, where $\bm{U}\!=\![\bm{u}_i]$ is an orthogonal matrix and $\bm{\Lambda}\!=\!{\rm diag}([\lambda_i]^{\mathsf{T}})$ is a diagonal matrix. Given a perturbation $\Delta\bm{U}\!=\![\Delta\bm{u}_i]$ and $s\in\mathbb{R}$, we

Figures (10)

Figure 1: Visualization of test accuracies and total GPU memory costs of vision transformers. 4-bit Shampoo (naive) quantizes the preconditioner, while 4-bit Shampoo (our) quantizes its eigenvector matrix.
Figure 2: Singular value distributions of PD matrices (real) and their 4-bit compressions (quan) used in Table \ref{['tab:quantization-errors-inverse-root']} with $\mathcal{R}$=DT, QM=$\bm{A}$. Singular values are shown on a $\log_{10}$ scale.
Figure 3: Elementwise mean errors between $(\bm{V}_{t_2}\bm{\Lambda}^s\bm{V}_{t_2}^{\mathsf{T}})^{-1\!/\!s}(\bm{V}_{t_2}\bm{\Lambda}\bm{V}_{t_2}^{\mathsf{T}})$ and identity matrix $\bm{I}$. Mean errors are shown on a $\log_{10}$ scale.
Figure 4: Visualization of test accuracies on the CIFAR-100 and ImageNet-1k datasets.
Figure 5: Visualization of DT quantization and Linear-2 quantization at $b$-bit ($b=3, 4$) precision.
...and 5 more figures

Theorems & Definitions (24)

Lemma 1
Lemma 2
Proposition 1
Lemma 3
Lemma 4
Lemma 5
Lemma 6
Lemma 7
Lemma 8: von Neumann
Lemma 9
...and 14 more

4-bit Shampoo for Memory-Efficient Network Training

TL;DR

Abstract

4-bit Shampoo for Memory-Efficient Network Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (24)