Elucidating Subspace Perturbation in Zeroth-Order Optimization: Theory and Practice at Scale
Sihwan Park, Jihun Yun, SungYub Kim, Souvik Kundu, Eunho Yang
TL;DR
This work addresses the slow convergence of zeroth-order optimization in large-scale settings by introducing a unified framework for subspace perturbations, characterized by subspace alignment $\rho_t$ and local intrinsic-dimension, which explains variance reduction and generalization under finite perturbations. It shows that a broad class of perturbations yields similar convergence rates, enabling practical design choices that prioritize efficiency. Building on these insights, the authors propose MeZO-BCD, a block-coordinate zeroth-order method that perturbs and updates only a subset of parameters per step, achieving up to $2.77\times$ wall-clock speedups on OPT-13B while maintaining iteration complexity and tuning performance. They validate the approach with extensive LLM-fine-tuning experiments and theoretical guarantees, and discuss extensions such as adaptive block selection and stateful optimizers, highlighting MeZO-BCD as a scalable foundation for dimension-efficient zeroth-order optimization.
Abstract
Zeroth-order (ZO) optimization has emerged as a promising alternative to gradient-based backpropagation methods, particularly for black-box optimization and large language model (LLM) fine-tuning. However, ZO methods often suffer from slow convergence due to high-variance stochastic gradient estimators. While subspace perturbations, such as sparsity and low-rank constraints, have been explored to mitigate this issue, their effectiveness remains poorly understood. In this work, we develop a \emph{unified theoretical framework} that analyzes both the convergence and generalization properties of ZO optimization under subspace perturbations. We show that high dimensionality is the primary bottleneck and introduce the notion of \textit{subspace alignment} to explain how the subspace perturbations reduce gradient noise and accelerate convergence. Our analysis further shows that a broad class of subspace perturbations exhibits a similar convergence rate, motivating us to prioritize practical considerations in real-world algorithm design. Building on these insights, we propose an efficient ZO method using block coordinate descent (MeZO-BCD), which perturbs and updates only a subset of parameters at each step. Extensive experiments show that MeZO-BCD significantly accelerates optimization, achieving up to $\mathbf{\times2.77}$ speedup in wall-clock time over MeZO on OPT-13B, while maintaining comparable iteration complexity and fine-tuning performance.
