Table of Contents
Fetching ...

Shampoo: Preconditioned Stochastic Tensor Optimization

Vineet Gupta, Tomer Koren, Yoram Singer

TL;DR

Shampoo introduces a structure-aware preconditioning method for stochastic optimization that maintains per-dimension full preconditioners to capture second-order information without the prohibitive cost of a full matrix. Grounded in online convex optimization and Kronecker-product theory, it provides convergence guarantees in the convex setting and extends to high-order tensors, yielding a scalable, tensor-wise preconditioner framework. Empirically, Shampoo accelerates convergence in deep learning tasks (image classification and language modeling) with runtimes per step comparable to traditional first-order methods due to efficient tensor contractions and partial diagonalization when needed. The approach offers a practical, architecture-agnostic optimizer with strong theoretical backing and broad applicability in large-scale models.

Abstract

Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state-of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Although it involves a more complex update rule, Shampoo's runtime per step is comparable to that of simple gradient methods such as SGD, AdaGrad, and Adam.

Shampoo: Preconditioned Stochastic Tensor Optimization

TL;DR

Shampoo introduces a structure-aware preconditioning method for stochastic optimization that maintains per-dimension full preconditioners to capture second-order information without the prohibitive cost of a full matrix. Grounded in online convex optimization and Kronecker-product theory, it provides convergence guarantees in the convex setting and extends to high-order tensors, yielding a scalable, tensor-wise preconditioner framework. Empirically, Shampoo accelerates convergence in deep learning tasks (image classification and language modeling) with runtimes per step comparable to traditional first-order methods due to efficient tensor contractions and partial diagonalization when needed. The approach offers a practical, architecture-agnostic optimizer with strong theoretical backing and broad applicability in large-scale models.

Abstract

Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state-of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Although it involves a more complex update rule, Shampoo's runtime per step is comparable to that of simple gradient methods such as SGD, AdaGrad, and Adam.

Paper Structure

This paper contains 24 sections, 15 theorems, 62 equations, 4 figures, 1 table, 3 algorithms.

Key Result

Lemma 1

For any sequence of matrices $H_1,\ldots,H_T \succ 0$, the regret of online mirror descent is bounded above by,

Figures (4)

  • Figure 1: Illustration of Shampoo for a $3$-dimensional tensor $G\in\mathbb{R}^{3 \times 4 \times 5}$.
  • Figure 2: Training loss for a residual network and an inception network on CIFAR-10.
  • Figure 3: Training loss for a residual network on CIFAR-100 (without batchnorm).
  • Figure 4: Test log-perplexity of an Attention model of vaswani2017attention.

Theorems & Definitions (26)

  • Lemma 1
  • Lemma 2: gupta2017unified
  • Lemma 3
  • Lemma 4
  • Lemma 5: ando2004geometric
  • Lemma 6
  • Theorem 7
  • Lemma 8
  • Lemma 9
  • proof
  • ...and 16 more