AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

Yifan Yang; Kai Zhen; Ershad Banijamal; Athanasios Mouchtaris; Zheng Zhang

AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

Yifan Yang, Kai Zhen, Ershad Banijamal, Athanasios Mouchtaris, Zheng Zhang

TL;DR

AdaZeta tackles the memory bottleneck in fine-tuning large language models by marrying ultra-low-parameter tensor-train adapters with a zeroth-order optimization framework. It introduces fast-forward tensor-train adapters and an adaptive, sublinear query schedule to stabilize and accelerate ZO fine-tuning, reducing memory without sacrificing accuracy. The authors provide a theoretical convergence bound that shows benefits from reducing trainable parameter count and from increasing the per-step query budget sublinearly, and validate the approach on Roberta-Large and Llama-2-7B across a range of tasks, achieving memory reductions of around 8×. Overall, AdaZeta enables memory-efficient, scalable fine-tuning of large models with competitive or superior performance compared to first-order PEFT baselines, offering practical pathways for low-resource or memory-constrained deployment.

Abstract

Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.

AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

TL;DR

Abstract

Paper Structure (21 sections, 2 theorems, 25 equations, 3 figures, 10 tables, 1 algorithm)

This paper contains 21 sections, 2 theorems, 25 equations, 3 figures, 10 tables, 1 algorithm.

Introduction
Background
Parameter-Efficient Fine-tuning
Tensorized Adapters
Methods
Zeroth-order Estimation
The AdaZeta Framework
Theoretical Analysis
Experiments
Medium-size Roberta-Large Models
Large-scale Llama-2 Models
Memory Training Time Efficiency
Further Comparison with LoRA
Conclusion
Detail of Experiment Setup
...and 6 more sections

Key Result

Theorem 1

Under A1 and A2, randomly pick $\bm{w}_T$ from history with probability $P(T=k)=\frac{1}{K}$, the convergence of the AdaZeta algorithm can be bounded by: where $R$ is defined by the distance between the start point and the optimal solution $\ell(\bm{w}_1) - \ell^*$, the ZO perturbation scaling factor is represented as $\epsilon$, and $C(d,\epsilon)$ is a constant related to the model parameter si

Figures (3)

Figure 1: The evaluation loss curves for the SST-2, WiC, and CB tasks using the Llama-2-7B model. The proposed AdaZeta method converges faster and effectively addresses the divergence problem using a much smaller batch size (BS). Both MeZO-LoRA and AdaZeta use a learning rate of 1e-4, while Sparse-MeZO utilizes a 1e-6 learning rate.
Figure 2: Illustration for tensorized linear layer and tensorized adapters.
Figure 3: Trade-off between the accuracy and memory cost for different fine-tuning methods. We can observe that the AdaZeta method achieves the best accuracy among the memory-efficient methods.

Theorems & Definitions (4)

Theorem 1
proof
Lemma 1
proof

AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

TL;DR

Abstract

AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (4)