Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

Yicheng Lang; Changsheng Wang; Yihua Zhang; Mingyi Hong; Zheng Zhang; Wotao Yin; Sijia Liu

Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

Yicheng Lang, Changsheng Wang, Yihua Zhang, Mingyi Hong, Zheng Zhang, Wotao Yin, Sijia Liu

TL;DR

This work shows that ZO optimization can be substantially improved by unifying two complementary principles: a projection-based subspace view that reduces gradient estimation variance by exploiting the intrinsic low-rank structure of model updates, and Muon-style spectral optimization that applies gradient orthogonalization to extract informative spectral structure from noisy ZO gradients.

Abstract

Zeroth-order (ZO) optimization provides a gradient-free alternative to first-order (FO) methods by estimating gradients via finite differences of function evaluations, and has recently emerged as a memory-efficient paradigm for fine-tuning large-scale models by avoiding backpropagation. However, ZO optimization has a fundamental tension between accuracy and query efficiency. In this work, we show that ZO optimization can be substantially improved by unifying two complementary principles: (i) a projection-based subspace view that reduces gradient estimation variance by exploiting the intrinsic low-rank structure of model updates, and (ii) Muon-style spectral optimization that applies gradient orthogonalization to extract informative spectral structure from noisy ZO gradients. These findings form a unified framework of subspace gradient orthogonalization, which we instantiate in a new method, ZO-Muon, admitting a natural interpretation as a low-rank Muon optimizer in the ZO setting. Extensive experiments on large language models (LLMs) and vision transformers (ViTs) demonstrate that ZO-Muon significantly accelerates convergence and achieves a win-win improvement in accuracy and query/runtime efficiency. Notably, compared to the popular MeZO baseline, ZO-Muon requires only 24.7% of the queries to reach the same SST-2 performance for LLM fine-tuning, and improves accuracy by 25.1% on ViT-B fine-tuning on CIFAR-100.

Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

TL;DR

Abstract

Paper Structure (25 sections, 1 theorem, 15 equations, 14 figures, 5 tables, 1 algorithm)

This paper contains 25 sections, 1 theorem, 15 equations, 14 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Rethinking ZO Optimization: From Full Space to A Projected Subspace View
ZO-Muon: Boosting Subspace ZO via Gradient Orthogonalization
Experiments
Experiment setups
Experiment results
Conclusion
Intrinsic Rank Measurement
Ineffectiveness of Full Space Gradient Orthogonalization
ZO-Muon Algorithm
Proof of Proposition \ref{['prop:low-rank-GO']}
Hyperparameter Study on $N_q$ in ZO-Muon
Detailed Experiment Setups
Models and datasets.
...and 10 more sections

Key Result

Proposition 1

If the projection is chosen as $\mathbf{P} = \mathbf{U}_{[:, :k]} \in \mathbb{R}^{m \times k}$, obtained from the SVD of $\mathbf{G}$ in (eq:matrix_sign), then the projection for gradient orthogonalization is lossless. That is,

Figures (14)

Figure 1: ZO-Muon achieves faster convergence and better fine-tuning accuracy than the SOTA ZO baselines MeZO malladi2023finetuning and LOZO chen2025enhancing against runtime. (a) OPT-1.3B fine-tuned on the RTE task: training loss (left) and test accuracy (right) versus runtime, with the cumulative query count indicated by the color bar in the right subplot. (b) ViT-B fine-tuned on CIFAR-10 (left) and CIFAR-100 (right), shown in the same format as in (a, right).
Figure 2: ZO-Muon yields lower runtime and fewer function queries than all baselines (under comparable GPU memory usage), when reaching the target fine-tuning accuracies: (a)$0.90$ on SST-2 and (b)$0.93$ on CIFAR-10. Baselines include MeZO malladi2023finetuning, SparseMeZO (S-MeZO) liu2025sparse, HiZOO zhao2025secondorder, and LOZO chen2025enhancing.
Figure 3: The necessity and advantages of Subspace RGE, illustrated by fine-tuning OPT-1.3B on SST-2: (a) The low-rank structure of model components during FO training with Adam across steps; (b) Performance comparison of Subspace RGE–based MeZO (i.e., Subspace-MeZO) with the SOTA baselines SparseMeZO (i.e., S-MeZO) and MeZO; And (c) accuracy vs. runtime.
Figure 4: Training (a) and testing (b) performance comparison of the proposed Subspace RGE–based MeZO (i.e., Subspace-MeZO) with existing low-rank perturbation–based ZO baselines, LOZO chen2025enhancing and SubZero yu2025zeroth, for fine-tuning OPT-13B on SQuAD whose performance is measured by F1 (%).
Figure 5: Demonstration of the ineffectiveness of ZO-Muon-V0 vs. MeZO and the effectiveness of ZO-Muon through fine-tuning OPT-13B on SST-2.
...and 9 more figures

Theorems & Definitions (1)

Proposition 1

Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

TL;DR

Abstract

Powering Up Zeroth-Order Training via Subspace Gradient Orthogonalization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (1)