Table of Contents
Fetching ...

LoRA meets Riemannion: Muon Optimizer for Parametrization-independent Low-Rank Adapters

Vladimir Bogachev, Vladimir Aletov, Alexander Molozhavenko, Denis Bobkov, Vera Soboleva, Aibek Alanov, Maxim Rakhuba

TL;DR

This work tackles the reparameterization sensitivity of LoRA by formulating LoRA optimization on the fixed-rank manifold $\mathcal{M}_r$ and introducing a MuON-inspired optimizer on that manifold, named Riemion. It combines a Riemannian gradient-based optimization, Locally Optimal Initialization (LOI) to place the initial point advantageously on $\mathcal{M}_r$, and an efficient, autodiff-friendly implementation (including OrthoLR/ProjectLR) to keep overhead minimal. Key contributions include the generalization of Muon to $\mathcal{M}_r$, a principled LOI with a closed-form-like initialization (e.g., $\Delta W^{(0)}_* = \alpha U_{1,r} V_{r,2r}^\top$ under suitable choices), and a single-backward-pass gradient trick enabling scalable computation. Empirically, Riemion delivers faster convergence and improved final task performance over standard LoRA and recent geometrically aware methods on both large language models and diffusion-based generation tasks, with reduced variance and competitive overhead.

Abstract

This work presents a novel, fully Riemannian framework for Low-Rank Adaptation (LoRA) that geometrically treats low-rank adapters by optimizing them directly on the fixed-rank manifold. This formulation eliminates the parametrization ambiguity present in standard Euclidean optimizers. Our framework integrates three key components to achieve this: (1) we derive Riemannion, a new Riemannian optimizer on the fixed-rank matrix manifold that generalizes the recently proposed Muon optimizer; (2) we develop a Riemannian gradient-informed LoRA initialization, and (3) we provide an efficient implementation without prominent overhead that uses automatic differentiation to compute arising geometric operations while adhering to best practices in numerical linear algebra. Comprehensive experimental results on both LLM and diffusion model architectures demonstrate that our approach yields consistent and noticeable improvements in convergence speed and final task performance over both standard LoRA and its state-of-the-art modifications.

LoRA meets Riemannion: Muon Optimizer for Parametrization-independent Low-Rank Adapters

TL;DR

This work tackles the reparameterization sensitivity of LoRA by formulating LoRA optimization on the fixed-rank manifold and introducing a MuON-inspired optimizer on that manifold, named Riemion. It combines a Riemannian gradient-based optimization, Locally Optimal Initialization (LOI) to place the initial point advantageously on , and an efficient, autodiff-friendly implementation (including OrthoLR/ProjectLR) to keep overhead minimal. Key contributions include the generalization of Muon to , a principled LOI with a closed-form-like initialization (e.g., under suitable choices), and a single-backward-pass gradient trick enabling scalable computation. Empirically, Riemion delivers faster convergence and improved final task performance over standard LoRA and recent geometrically aware methods on both large language models and diffusion-based generation tasks, with reduced variance and competitive overhead.

Abstract

This work presents a novel, fully Riemannian framework for Low-Rank Adaptation (LoRA) that geometrically treats low-rank adapters by optimizing them directly on the fixed-rank manifold. This formulation eliminates the parametrization ambiguity present in standard Euclidean optimizers. Our framework integrates three key components to achieve this: (1) we derive Riemannion, a new Riemannian optimizer on the fixed-rank matrix manifold that generalizes the recently proposed Muon optimizer; (2) we develop a Riemannian gradient-informed LoRA initialization, and (3) we provide an efficient implementation without prominent overhead that uses automatic differentiation to compute arising geometric operations while adhering to best practices in numerical linear algebra. Comprehensive experimental results on both LLM and diffusion model architectures demonstrate that our approach yields consistent and noticeable improvements in convergence speed and final task performance over both standard LoRA and its state-of-the-art modifications.

Paper Structure

This paper contains 32 sections, 4 theorems, 60 equations, 4 figures, 7 tables, 6 algorithms.

Key Result

Theorem 5.1

Let the SVD of $\nabla_W \mathcal{L}(W)$ be: and let also $\sigma_{2r}\not = \sigma_{2r+1}$. Then any optimal solution $\Delta W^{(0)}_{*}$ to the problem (eq:inital_opt_task) has the form:

Figures (4)

  • Figure 1: Comparison of text and image similarities for LoRA and our method with rank 4 at different learning rates on 400 step.
  • Figure 2: Visual results for Subject-driven generation on 600 training step.
  • Figure 3: Additional visual comparison of our method and LoRA, checkpoint $600$
  • Figure 4: Relative Time Cost $\left( \text{calculated as } \left(T_\text{Riemannion} - T_\text{Adam}\right) / {T_\text{Adam}} \right)$ of Riemannion vs. Adam during Llama 3-8B fine-tuning, as a function of LoRA rank and batch size.

Theorems & Definitions (10)

  • Theorem 5.1
  • proof
  • proof
  • Proposition D.1
  • proof
  • Proposition D.2
  • proof
  • Lemma D.1
  • proof
  • Remark D.1