Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?

Weijie Su

Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?

Weijie Su

TL;DR

Problem: understanding when gradient orthogonalization used by matrix-gradient optimizers is optimal. Approach: isotropic curvature model that reduces a one-step update to the convex program $min_Q -Tr(Q G^T) + E_{zeta ~ sphere} H(||Q zeta||)$ driven by a gradient G at W and a curvature function H. Findings: under super-quadratic growth of the high-order term, the global solution Q^* preserves the singular spaces of G and yields spectrum homogenization; under a kink in curvature the optimal update is proportional to the gradient's sign, i.e., orthogonalization; Muon is consistent with the model but not strictly optimal in general. Impact: provides a theoretical foundation for matrix-gradient methods and suggests solving the convex update program with an approximate curvature H to design curvature-aware optimizers for large language models.

Abstract

In this paper, we introduce a model for analyzing deep learning optimization over a single iteration by leveraging the matrix structure of the weights. We derive the model by assuming isotropy of curvature, including the second-order Hessian and higher-order terms, of the loss function across all perturbation directions; hence, we call it the isotropic curvature model. This model is a convex optimization program amenable to analysis, which allows us to understand how an update on the weights in the form of a matrix relates to the change in the total loss function. As an application, we use the isotropic curvature model to analyze the recently introduced Muon optimizer and other matrix-gradient methods for training language models. First, we show that under a general growth condition on the curvature, the optimal update matrix is obtained by making the spectrum of the original gradient matrix more homogeneous -- that is, making its singular values closer in ratio -- which in particular improves the conditioning of the update matrix. Next, we show that the orthogonalized gradient becomes optimal for the isotropic curvature model when the curvature exhibits a phase transition in growth. Taken together, these results suggest that the gradient orthogonalization employed in Muon and other related methods is directionally correct but may not be strictly optimal. Finally, we discuss future research on how to leverage the isotropic curvature model for designing new optimization methods for training deep learning and language models.

Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?

TL;DR

Abstract

Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (16)