Derivative-Free Optimization for Low-Rank Adaptation in Large Language Models
Feihu Jin, Yin Liu, Ying Tan
TL;DR
The paper tackles the high computational cost of gradient-based fine-tuning in large language models by introducing a derivative-free optimization framework for low-rank adapters inserted into every self-attention layer. It optimizes per-layer low-rank factors using two DFO methods (CMA-ES and Fireworks Algorithm) in a layer-wise divide-and-conquer fashion, with a linear mapping from optimized subspaces to the low-rank modules and updates to $W_Q$ and $W_K$ via $W_Q \leftarrow W_Q + B_Q A_Q$ and $W_K \leftarrow W_K + B_K A_K$. Initialization of the projection modules is crucial; initializing with the hidden-state distribution yields better performance than random initialization. Empirically, the approach yields memory savings, faster convergence, and strong few-shot gains on RoBERTa-large and GPT2-family models, outperforming both gradient-based parameter-efficient tuning methods and prior gradient-free baselines across seven NLU tasks and multiple model sizes.
Abstract
Parameter-efficient tuning methods such as LoRA could achieve comparable performance to model tuning by tuning a small portion of the parameters. However, substantial computational resources are still required, as this process involves calculating gradients and performing back-propagation throughout the model. Much effort has recently been devoted to utilizing the derivative-free optimization method to eschew the computation of gradients and showcase an augmented level of robustness in few-shot settings. In this paper, we prepend the low-rank modules into each self-attention layer of the model and employ two derivative-free optimization methods to optimize these low-rank modules at each layer alternately. Extensive results on various tasks and language models demonstrate that our proposed method achieves substantial improvement and exhibits clear advantages in memory usage and convergence speed compared to existing gradient-based parameter-efficient tuning and derivative-free optimization methods in few-shot settings.
