Table of Contents
Fetching ...

Elucidating Subspace Perturbation in Zeroth-Order Optimization: Theory and Practice at Scale

Sihwan Park, Jihun Yun, SungYub Kim, Souvik Kundu, Eunho Yang

TL;DR

This work addresses the slow convergence of zeroth-order optimization in large-scale settings by introducing a unified framework for subspace perturbations, characterized by subspace alignment $\rho_t$ and local intrinsic-dimension, which explains variance reduction and generalization under finite perturbations. It shows that a broad class of perturbations yields similar convergence rates, enabling practical design choices that prioritize efficiency. Building on these insights, the authors propose MeZO-BCD, a block-coordinate zeroth-order method that perturbs and updates only a subset of parameters per step, achieving up to $2.77\times$ wall-clock speedups on OPT-13B while maintaining iteration complexity and tuning performance. They validate the approach with extensive LLM-fine-tuning experiments and theoretical guarantees, and discuss extensions such as adaptive block selection and stateful optimizers, highlighting MeZO-BCD as a scalable foundation for dimension-efficient zeroth-order optimization.

Abstract

Zeroth-order (ZO) optimization has emerged as a promising alternative to gradient-based backpropagation methods, particularly for black-box optimization and large language model (LLM) fine-tuning. However, ZO methods often suffer from slow convergence due to high-variance stochastic gradient estimators. While subspace perturbations, such as sparsity and low-rank constraints, have been explored to mitigate this issue, their effectiveness remains poorly understood. In this work, we develop a \emph{unified theoretical framework} that analyzes both the convergence and generalization properties of ZO optimization under subspace perturbations. We show that high dimensionality is the primary bottleneck and introduce the notion of \textit{subspace alignment} to explain how the subspace perturbations reduce gradient noise and accelerate convergence. Our analysis further shows that a broad class of subspace perturbations exhibits a similar convergence rate, motivating us to prioritize practical considerations in real-world algorithm design. Building on these insights, we propose an efficient ZO method using block coordinate descent (MeZO-BCD), which perturbs and updates only a subset of parameters at each step. Extensive experiments show that MeZO-BCD significantly accelerates optimization, achieving up to $\mathbf{\times2.77}$ speedup in wall-clock time over MeZO on OPT-13B, while maintaining comparable iteration complexity and fine-tuning performance.

Elucidating Subspace Perturbation in Zeroth-Order Optimization: Theory and Practice at Scale

TL;DR

This work addresses the slow convergence of zeroth-order optimization in large-scale settings by introducing a unified framework for subspace perturbations, characterized by subspace alignment and local intrinsic-dimension, which explains variance reduction and generalization under finite perturbations. It shows that a broad class of perturbations yields similar convergence rates, enabling practical design choices that prioritize efficiency. Building on these insights, the authors propose MeZO-BCD, a block-coordinate zeroth-order method that perturbs and updates only a subset of parameters per step, achieving up to wall-clock speedups on OPT-13B while maintaining iteration complexity and tuning performance. They validate the approach with extensive LLM-fine-tuning experiments and theoretical guarantees, and discuss extensions such as adaptive block selection and stateful optimizers, highlighting MeZO-BCD as a scalable foundation for dimension-efficient zeroth-order optimization.

Abstract

Zeroth-order (ZO) optimization has emerged as a promising alternative to gradient-based backpropagation methods, particularly for black-box optimization and large language model (LLM) fine-tuning. However, ZO methods often suffer from slow convergence due to high-variance stochastic gradient estimators. While subspace perturbations, such as sparsity and low-rank constraints, have been explored to mitigate this issue, their effectiveness remains poorly understood. In this work, we develop a \emph{unified theoretical framework} that analyzes both the convergence and generalization properties of ZO optimization under subspace perturbations. We show that high dimensionality is the primary bottleneck and introduce the notion of \textit{subspace alignment} to explain how the subspace perturbations reduce gradient noise and accelerate convergence. Our analysis further shows that a broad class of subspace perturbations exhibits a similar convergence rate, motivating us to prioritize practical considerations in real-world algorithm design. Building on these insights, we propose an efficient ZO method using block coordinate descent (MeZO-BCD), which perturbs and updates only a subset of parameters at each step. Extensive experiments show that MeZO-BCD significantly accelerates optimization, achieving up to speedup in wall-clock time over MeZO on OPT-13B, while maintaining comparable iteration complexity and fine-tuning performance.

Paper Structure

This paper contains 67 sections, 19 theorems, 115 equations, 10 figures, 11 tables, 5 algorithms.

Key Result

Theorem 3.3

Suppose that con:conv_smooth, con:conv_gradient, and Assumption assumption:intdim hold with $\mathrm{srank}(\mathbf{M}_t) \le s$. Under the following parameter settings $\mu = \mathcal{O}(\frac{1}{L\sqrt{ { \macc@depth1 \frozen@everymath{\mathgroup\macc@group} \macc@set@skewchar \macc@nested@a111{ where $\Delta \coloneqq \mathcal{L}_\mu(\bm\theta_1) - \mathcal{L}_\mu(\bm\theta_T)$.

Figures (10)

  • Figure 1: Empirical validation of theoretical insights on randomized quadratic minimization. Left: Higher $\bar{\rho}$ values yield faster convergence, as predicted ($s$ is fixed). Middle: Iterations required to reach a fixed loss target decrease proportionally with $1/\bar{\rho}$. Right: Distribution of $\rho$ under three $\mathbf{M}$ types shows matching means but different concentration patterns. Horizontal blue dashed lines indicate the $\bar{\rho}$ for each method. Supplementary results are provided in Appendix \ref{['app:supp_empirical_study']}.
  • Figure 2: Comparison of training loss curves in terms of iterations (left) and wall-clock time (middle) on OPT-1.3B zhang2022opt fine-tuning with SST-2 sst2 for different methods.
  • Figure 3: Comparative visualization of the distribution of subspace alignment $\rho$ under varying $\mathrm{srank}(\mathbf{M})$ across three perturbation types: low-rank projection, sparse perturbation, and block sparse perturbation. (a) The log-scaled boxen plot visualizes the lower tail behavior of $\rho$. (b) The strip plot highlights the discrete nature of block sparse perturbations.
  • Figure 4: Training loss curves for varying $\mathrm{srank}(\mathbf{M})$ across low-rank, sparse, and block sparse perturbation methods. Each plot corresponds to a fixed $\mathrm{srank}(\mathbf{M})$. The reported $\bar{\rho}$ values indicate the mean subspace alignment for each method. Learning rate $\eta = 10^{-4}$ is used throughout.
  • Figure 5: Training loss curves averaged over three seeds {$0$, $42$, $100$} for OPT-13B on SST-2 and SQuAD. The shaded region indicates the standard deviation.
  • ...and 5 more figures

Theorems & Definitions (39)

  • Definition 3.2: Subspace Alignment
  • Theorem 3.3: Dimension-Free Rate of Subspace ZO-SGD
  • Proposition 3.4: Expected Subspace Alignment $\rho$
  • Definition 3.5: Three Instantiations of $\mathbf{M}$
  • Proposition 3.6: Upper Tail Probabilities of $\rho$
  • Definition 3.7: Generalization Error
  • Definition 3.8: Uniform Stability bousquet2002stabilityhardt2016train
  • Lemma 3.9: Theorem 2.2 in hardt2016train
  • Theorem 3.10: Generalization of Subspace ZO-SGD
  • Lemma B.1: Alignment of Smoothed Gradients
  • ...and 29 more