Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling

Valery Parfenov; Grigoriy Evseev; Andrey Veprikov; Nikolay Bushkov; Stanislav Moiseev; Aleksandr Beznosikov

Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling

Valery Parfenov, Grigoriy Evseev, Andrey Veprikov, Nikolay Bushkov, Stanislav Moiseev, Aleksandr Beznosikov

TL;DR

Memory-efficient fine-tuning of large language models is hindered by high memory costs of backpropagation. The paper introduces Learnable Direction Sampling Descent (LDSD), which treats the mean of the perturbation distribution as a learnable policy to align zero-order directional derivatives with the true gradient. Theoretical results show that gradient alignment increases over iterations and convergence bounds can be made dimension-free, while the approach remains a plug-in for existing ZO optimizers. Empirical results on SST-2 with RoBERTa-Large and OPT-1.3B show consistent improvements over standard ZO baselines, supporting the practical viability of adaptive direction sampling for scalable zero-order fine-tuning.

Abstract

Fine-tuning large pretrained language models (LLMs) is a cornerstone of modern NLP, yet its growing memory demands (driven by backpropagation and large optimizer States) limit deployment in resource-constrained settings. Zero-order (ZO) methods bypass backpropagation by estimating directional derivatives from forward evaluations, offering substantial memory savings. However, classical ZO estimators suffer from high variance and an adverse dependence on the parameter dimensionality $d$, which has constrained their use to low-dimensional problems. In this work, we propose a policy-driven ZO framework that treats the sampling distribution over perturbation directions as a learnable policy and updates it to reduce the variance of directional estimates. We develop a practical algorithm implementing this idea and provide a theoretical analysis, showing that learned sampling distributions improve the quality of gradient information and relax the explicit dependence on $d$ in convergence bounds. Empirically, we validate the approach on challenging LLM fine-tuning benchmarks, demonstrating substantially improved performance compared to standard ZO baselines. Our results suggest that adaptive direction sampling is a promising route to make ZO fine-tuning viable at scale. The source code is available at https://github.com/brain-lab-research/zo_ldsd

Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling

TL;DR

Abstract

, which has constrained their use to low-dimensional problems. In this work, we propose a policy-driven ZO framework that treats the sampling distribution over perturbation directions as a learnable policy and updates it to reduce the variance of directional estimates. We develop a practical algorithm implementing this idea and provide a theoretical analysis, showing that learned sampling distributions improve the quality of gradient information and relax the explicit dependence on

in convergence bounds. Empirically, we validate the approach on challenging LLM fine-tuning benchmarks, demonstrating substantially improved performance compared to standard ZO baselines. Our results suggest that adaptive direction sampling is a promising route to make ZO fine-tuning viable at scale. The source code is available at https://github.com/brain-lab-research/zo_ldsd

Paper Structure (26 sections, 8 theorems, 80 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 26 sections, 8 theorems, 80 equations, 3 figures, 2 tables, 2 algorithms.

Introduction
Related Work
Memory efficient approaches
ZO optimization with adaptive direction sampling
Algorithm and Theoretical Analysis under Directional Setup
Notation and Assumptions
Motivation
Theoretical Algorithm
The dynamics of gradient alignment
Intuition.
Convergence Guaranties
Discussion.
Toy Experiment
Zero-order Framework
Experiments
...and 11 more sections

Key Result

Lemma 2

Under Assumptions ass:lip and ass:minimizer for iterations of DGD eq:cgd_iteration the following inequality holds: where $\mathcal{F}^t$ is a $\sigma$-algebra generated by $\{x^0, x^1, \dots, x^t\}$.

Figures (3)

Figure 1: Landscape of the function $\mathbb{E}\left[ C^{t} \mid \mathcal{F}^{t-1}\right]$ with respect to $\mu^t$, for $\nabla f(x^t) = (1,0)^\top$ and $d=2$.
Figure 2: Comparison of LDSD and the baseline on the a9a regression task.
Figure 3: Test accuracy of ZO-SGD (Algorithm \ref{['alg:zo_framework']} sampling) on SST-2 RoBERTa-large with LoRA for different hyperparameters.

Theorems & Definitions (16)

Lemma 2
Corollary 2
Theorem 2
Lemma 3
Lemma 4
proof
proof
Lemma 5
proof
proof : Proof of Theorem \ref{['th:main']}
...and 6 more

Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling

TL;DR

Abstract

Zero-Order Optimization for LLM Fine-Tuning via Learnable Direction Sampling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (16)