Table of Contents
Fetching ...

Muon Converges under Heavy-Tailed Noise: Nonconvex Hölder-Smooth Empirical Risk Minimization

Hideaki Iiduka

Abstract

Muon is a recently proposed optimizer that enforces orthogonality in parameter updates by projecting gradients onto the Stiefel manifold, leading to stable and efficient training in large-scale deep neural networks. Meanwhile, the previously reported results indicated that stochastic noise in practical machine learning may exhibit heavy-tailed behavior, violating the bounded-variance assumption. In this paper, we consider the problem of minimizing a nonconvex Hölder-smooth empirical risk that works well with the heavy-tailed stochastic noise. We then show that Muon converges to a stationary point of the empirical risk under the boundedness condition accounting for heavy-tailed stochastic noise. In addition, we show that Muon converges faster than mini-batch SGD.

Muon Converges under Heavy-Tailed Noise: Nonconvex Hölder-Smooth Empirical Risk Minimization

Abstract

Muon is a recently proposed optimizer that enforces orthogonality in parameter updates by projecting gradients onto the Stiefel manifold, leading to stable and efficient training in large-scale deep neural networks. Meanwhile, the previously reported results indicated that stochastic noise in practical machine learning may exhibit heavy-tailed behavior, violating the bounded-variance assumption. In this paper, we consider the problem of minimizing a nonconvex Hölder-smooth empirical risk that works well with the heavy-tailed stochastic noise. We then show that Muon converges to a stationary point of the empirical risk under the boundedness condition accounting for heavy-tailed stochastic noise. In addition, we show that Muon converges faster than mini-batch SGD.
Paper Structure (32 sections, 13 theorems, 96 equations)

This paper contains 32 sections, 13 theorems, 96 equations.

Key Result

Proposition 2.1

Suppose that Assumption assum:1 holds and let $\nabla f_{\bm{\xi}} (\bm{W})$ be defined by mini_batch. Then, the following hold.

Theorems & Definitions (14)

  • Example 2.1: Example satisfying Assumption \ref{['assum:1']}(A2)
  • Proposition 2.1
  • Lemma 3.1
  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Lemma 4.1
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • ...and 4 more