Table of Contents
Fetching ...

Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials

Ekaterina Grishina, Matvey Smirnov, Maxim Rakhuba

TL;DR

A Chebyshev-optimized version of Newton-Schulz (CANS) is proposed, based on the Chebyshev's alternance theorem, which theoretically derive optimal coefficients for the 3-rd order Newton-Schulz iteration and applies a Remez algorithm to compute optimal higher-degree polynomials.

Abstract

The problem of computing optimal orthogonal approximation to a given matrix has attracted growing interest in machine learning. Notable applications include the recent Muon optimizer or Riemannian optimization on the Stiefel manifold. Among existing approaches, the Newton-Schulz iteration has emerged as a particularly effective solution, as it relies solely on matrix multiplications and thus achieves high computational efficiency on GPU hardware. Despite its efficiency, the method has inherent limitations - its coefficients are fixed and thus not optimized for a given matrix. In this paper we address this issue by proposing a Chebyshev-optimized version of Newton-Schulz (CANS). Based on the Chebyshev's alternance theorem, we theoretically derive optimal coefficients for the 3-rd order Newton-Schulz iteration and apply a Remez algorithm to compute optimal higher-degree polynomials. We leverage these polynomials to construct controlled approximate orthogonalization schemes, which is of interest in deep learning applications. Practically, we demonstrate the method's effectiveness in two key applications: orthogonalization in the Muon optimizer, and providing an efficient retraction alternative for Riemannian optimization on the Stiefel manifold.

Accelerating Newton-Schulz Iteration for Orthogonalization via Chebyshev-type Polynomials

TL;DR

A Chebyshev-optimized version of Newton-Schulz (CANS) is proposed, based on the Chebyshev's alternance theorem, which theoretically derive optimal coefficients for the 3-rd order Newton-Schulz iteration and applies a Remez algorithm to compute optimal higher-degree polynomials.

Abstract

The problem of computing optimal orthogonal approximation to a given matrix has attracted growing interest in machine learning. Notable applications include the recent Muon optimizer or Riemannian optimization on the Stiefel manifold. Among existing approaches, the Newton-Schulz iteration has emerged as a particularly effective solution, as it relies solely on matrix multiplications and thus achieves high computational efficiency on GPU hardware. Despite its efficiency, the method has inherent limitations - its coefficients are fixed and thus not optimized for a given matrix. In this paper we address this issue by proposing a Chebyshev-optimized version of Newton-Schulz (CANS). Based on the Chebyshev's alternance theorem, we theoretically derive optimal coefficients for the 3-rd order Newton-Schulz iteration and apply a Remez algorithm to compute optimal higher-degree polynomials. We leverage these polynomials to construct controlled approximate orthogonalization schemes, which is of interest in deep learning applications. Practically, we demonstrate the method's effectiveness in two key applications: orthogonalization in the Muon optimizer, and providing an efficient retraction alternative for Riemannian optimization on the Stiefel manifold.

Paper Structure

This paper contains 27 sections, 7 theorems, 42 equations, 9 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.1

Let $0 < a < b$, $n \in \mathbb N$, and $f \in C[a,b]$. Then the following statements hold.

Figures (9)

  • Figure 1: Illustration of the selection of a degree-3 ($d=2$) polynomial with a large derivative at zero. The green polynomial falls into $[1-\delta, 1+\delta]$, but has insufficient derivative. The blue polynomial $q_{d, \delta}$ has the highest possible derivative among polynomials from $\mathcal{P}_{d, \delta}$. The purple polynomial is not part of $\mathcal{P}_{d, \delta}$, and its derivative is too large.
  • Figure 2: Convergence of iterative algorithms for matrix orthogonalization. The solid lines show the performance when the exact values of $\sigma_1(A), \sigma_n(A)$ are known, and the matrix is normalized by $\sigma_1(A)$. In other cases, the matrix is normalized by $\|(A^TA)^2\|^{1/4}_F$ and the precise value of the left boundary is $\sigma_n(A)/\|(A^TA)^2\|^{1/4}_F=9e{-}5$. The striped lines show performance for overestimated boundary $a_0=1e{-}3$, the dotted lines -- for underestimated $a_0=1e{-}7$. The dashdotted lines show convergence of algorithm with 4 iterations of $\delta$-orthogonalization (Algorithm \ref{['alg:preprocessing']}).
  • Figure 3: Comparison of CANS with the original Muon polynomial. Zoomed plot shows behavior near zero. "iter" denotes number of polynomials in composition, "mm" - total number of matmuls.
  • Figure 4: Comparison of CANS polynomials with jiacheng.
  • Figure 5: Test loss of NanoGPT trained using Muon optimizer with different polynomials.
  • ...and 4 more figures

Theorems & Definitions (22)

  • Theorem 3.1
  • proof
  • Proposition 3.2
  • proof
  • Proposition 3.3
  • proof
  • Proposition 3.4
  • proof
  • Corollary 3.5
  • proof
  • ...and 12 more