Table of Contents
Fetching ...

Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?

Weijie Su

TL;DR

Problem: understanding when gradient orthogonalization used by matrix-gradient optimizers is optimal. Approach: isotropic curvature model that reduces a one-step update to the convex program $min_Q -Tr(Q G^T) + E_{zeta ~ sphere} H(||Q zeta||)$ driven by a gradient G at W and a curvature function H. Findings: under super-quadratic growth of the high-order term, the global solution Q^* preserves the singular spaces of G and yields spectrum homogenization; under a kink in curvature the optimal update is proportional to the gradient's sign, i.e., orthogonalization; Muon is consistent with the model but not strictly optimal in general. Impact: provides a theoretical foundation for matrix-gradient methods and suggests solving the convex update program with an approximate curvature H to design curvature-aware optimizers for large language models.

Abstract

In this paper, we introduce a model for analyzing deep learning optimization over a single iteration by leveraging the matrix structure of the weights. We derive the model by assuming isotropy of curvature, including the second-order Hessian and higher-order terms, of the loss function across all perturbation directions; hence, we call it the isotropic curvature model. This model is a convex optimization program amenable to analysis, which allows us to understand how an update on the weights in the form of a matrix relates to the change in the total loss function. As an application, we use the isotropic curvature model to analyze the recently introduced Muon optimizer and other matrix-gradient methods for training language models. First, we show that under a general growth condition on the curvature, the optimal update matrix is obtained by making the spectrum of the original gradient matrix more homogeneous -- that is, making its singular values closer in ratio -- which in particular improves the conditioning of the update matrix. Next, we show that the orthogonalized gradient becomes optimal for the isotropic curvature model when the curvature exhibits a phase transition in growth. Taken together, these results suggest that the gradient orthogonalization employed in Muon and other related methods is directionally correct but may not be strictly optimal. Finally, we discuss future research on how to leverage the isotropic curvature model for designing new optimization methods for training deep learning and language models.

Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?

TL;DR

Problem: understanding when gradient orthogonalization used by matrix-gradient optimizers is optimal. Approach: isotropic curvature model that reduces a one-step update to the convex program driven by a gradient G at W and a curvature function H. Findings: under super-quadratic growth of the high-order term, the global solution Q^* preserves the singular spaces of G and yields spectrum homogenization; under a kink in curvature the optimal update is proportional to the gradient's sign, i.e., orthogonalization; Muon is consistent with the model but not strictly optimal in general. Impact: provides a theoretical foundation for matrix-gradient methods and suggests solving the convex update program with an approximate curvature H to design curvature-aware optimizers for large language models.

Abstract

In this paper, we introduce a model for analyzing deep learning optimization over a single iteration by leveraging the matrix structure of the weights. We derive the model by assuming isotropy of curvature, including the second-order Hessian and higher-order terms, of the loss function across all perturbation directions; hence, we call it the isotropic curvature model. This model is a convex optimization program amenable to analysis, which allows us to understand how an update on the weights in the form of a matrix relates to the change in the total loss function. As an application, we use the isotropic curvature model to analyze the recently introduced Muon optimizer and other matrix-gradient methods for training language models. First, we show that under a general growth condition on the curvature, the optimal update matrix is obtained by making the spectrum of the original gradient matrix more homogeneous -- that is, making its singular values closer in ratio -- which in particular improves the conditioning of the update matrix. Next, we show that the orthogonalized gradient becomes optimal for the isotropic curvature model when the curvature exhibits a phase transition in growth. Taken together, these results suggest that the gradient orthogonalization employed in Muon and other related methods is directionally correct but may not be strictly optimal. Finally, we discuss future research on how to leverage the isotropic curvature model for designing new optimization methods for training deep learning and language models.

Paper Structure

This paper contains 12 sections, 6 theorems, 39 equations, 1 figure.

Key Result

Proposition 3.1

There exists a global solution $Q^\star$ to the optimization program of the isotropic curvature model such that the singular spaces of $Q^\star$ and $G$ are aligned.

Figures (1)

  • Figure 1: Numerical approximation of $H(r)/r^2$ on GPT-2 (small) radford2019language, for fully connected layers (left) and attention layers (right). The values in parentheses in the legend indicate the estimated exponent $2 + \alpha$ of $H(r)$ starting from $r = 10^{0.5}$. For example, $2.39$ and $2.50$ in the first line are the exponents for the left and right panels, respectively. We sample $100$ random directions $\delta W$ from a standard normal distribution, sample the first $300$ examples from the C4-newslike validation split raffel2020exploring and truncate each example to $100$ tokens. For each radius $r$, we construct perturbations $\delta W u_i$, where $u_i$ is the input to the layer, and renormalize each $\delta W u_i$ to have norm $r$. We then compute $H$ by subtracting $L(Wu_i) + \langle \nabla L(Wu_i), \delta W u_i \rangle$ from $L(Wu_i + \delta W u_i)$. This yields a total of $100 \times 300 \times (100-1) = 2{,}970{,}000$ remainder terms for approximating $H(r)/r^2$ per radius $r$. Note that the growth of $H$ will eventually plateau for very large $r$.

Theorems & Definitions (16)

  • Proposition 3.1
  • Theorem 1
  • Remark 3.1
  • Theorem 2
  • Proposition 3.4
  • Lemma 4.1: von Neumann's trace inequality
  • proof : Proof of Proposition \ref{['prop:orth_spaces']}
  • Remark 4.1
  • proof : Proof of Theorem \ref{['thm:homo']}
  • Remark 4.2
  • ...and 6 more