Table of Contents
Fetching ...

DeMuon: A Decentralized Muon for Matrix Optimization over Graphs

Chuan He, Shuyi Ren, Jingwei Mao, Erik G. Larsson

TL;DR

This work addresses decentralized matrix optimization over graphs under heavy-tailed gradient noise by extending Muon to a decentralized setting as DeMuon. The method leverages matrix orthogonalization through Newton-Schulz-type updates and gradient tracking to mitigate heterogeneity, yielding an $\epsilon$-nuclear-norm stochastic stationary point with iteration complexity matching centralized bounds in $\epsilon$. The authors prove consensus and convergence guarantees and demonstrate preliminary gains in decentralized transformer pretraining across various network topologies. The results suggest DeMuon as a robust, communication-efficient approach for large-scale matrix-variates in distributed environments.

Abstract

In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton-Schulz iterations-a technique inherited from its centralized predecessor, Muon-and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.

DeMuon: A Decentralized Muon for Matrix Optimization over Graphs

TL;DR

This work addresses decentralized matrix optimization over graphs under heavy-tailed gradient noise by extending Muon to a decentralized setting as DeMuon. The method leverages matrix orthogonalization through Newton-Schulz-type updates and gradient tracking to mitigate heterogeneity, yielding an -nuclear-norm stochastic stationary point with iteration complexity matching centralized bounds in . The authors prove consensus and convergence guarantees and demonstrate preliminary gains in decentralized transformer pretraining across various network topologies. The results suggest DeMuon as a robust, communication-efficient approach for large-scale matrix-variates in distributed environments.

Abstract

In this paper, we propose DeMuon, a method for decentralized matrix optimization over a given communication topology. DeMuon incorporates matrix orthogonalization via Newton-Schulz iterations-a technique inherited from its centralized predecessor, Muon-and employs gradient tracking to mitigate heterogeneity among local functions. Under heavy-tailed noise conditions and additional mild assumptions, we establish the iteration complexity of DeMuon for reaching an approximate stochastic stationary point. This complexity result matches the best-known complexity bounds of centralized algorithms in terms of dependence on the target tolerance. To the best of our knowledge, DeMuon is the first direct extension of Muon to decentralized optimization over graphs with provable complexity guarantees. We conduct preliminary numerical experiments on decentralized transformer pretraining over graphs with varying degrees of connectivity. Our numerical results demonstrate a clear margin of improvement of DeMuon over other popular decentralized algorithms across different network topologies.

Paper Structure

This paper contains 7 sections, 8 theorems, 49 equations, 2 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

Suppose that Assumption asp:basic holds. Let $\{X_i^k\}$ be generated by Algorithm alg:r-msgn-1 with step size $\eta>0$, and let $\lambda$ be given in upbd:eig-W. Then, it holds that $\|X_{[N]}^k - \mathbf{1}_N\otimes\overline{X}^k\| \le \sqrt{N}\lambda\eta/(1-\lambda)$ for all $k\ge0$.

Figures (2)

  • Figure 1: Training losses in decentralized training of Transformer models over complete, directed exponential, and ring graphs.
  • Figure 2: Validation losses in decentralized training of Transformer models over complete, directed exponential, and ring graphs.

Theorems & Definitions (18)

  • Remark 1
  • Lemma 1: consensus error
  • Remark 2
  • Theorem 1: iteration complexity
  • Remark 3
  • Lemma 2
  • Lemma 3
  • proof
  • proof : Proof of Lemma \ref{['lem:cs-error']}
  • Lemma 4
  • ...and 8 more