Table of Contents
Fetching ...

Derivative-Free Optimization for Low-Rank Adaptation in Large Language Models

Feihu Jin, Yin Liu, Ying Tan

TL;DR

The paper tackles the high computational cost of gradient-based fine-tuning in large language models by introducing a derivative-free optimization framework for low-rank adapters inserted into every self-attention layer. It optimizes per-layer low-rank factors using two DFO methods (CMA-ES and Fireworks Algorithm) in a layer-wise divide-and-conquer fashion, with a linear mapping from optimized subspaces to the low-rank modules and updates to $W_Q$ and $W_K$ via $W_Q \leftarrow W_Q + B_Q A_Q$ and $W_K \leftarrow W_K + B_K A_K$. Initialization of the projection modules is crucial; initializing with the hidden-state distribution yields better performance than random initialization. Empirically, the approach yields memory savings, faster convergence, and strong few-shot gains on RoBERTa-large and GPT2-family models, outperforming both gradient-based parameter-efficient tuning methods and prior gradient-free baselines across seven NLU tasks and multiple model sizes.

Abstract

Parameter-efficient tuning methods such as LoRA could achieve comparable performance to model tuning by tuning a small portion of the parameters. However, substantial computational resources are still required, as this process involves calculating gradients and performing back-propagation throughout the model. Much effort has recently been devoted to utilizing the derivative-free optimization method to eschew the computation of gradients and showcase an augmented level of robustness in few-shot settings. In this paper, we prepend the low-rank modules into each self-attention layer of the model and employ two derivative-free optimization methods to optimize these low-rank modules at each layer alternately. Extensive results on various tasks and language models demonstrate that our proposed method achieves substantial improvement and exhibits clear advantages in memory usage and convergence speed compared to existing gradient-based parameter-efficient tuning and derivative-free optimization methods in few-shot settings.

Derivative-Free Optimization for Low-Rank Adaptation in Large Language Models

TL;DR

The paper tackles the high computational cost of gradient-based fine-tuning in large language models by introducing a derivative-free optimization framework for low-rank adapters inserted into every self-attention layer. It optimizes per-layer low-rank factors using two DFO methods (CMA-ES and Fireworks Algorithm) in a layer-wise divide-and-conquer fashion, with a linear mapping from optimized subspaces to the low-rank modules and updates to and via and . Initialization of the projection modules is crucial; initializing with the hidden-state distribution yields better performance than random initialization. Empirically, the approach yields memory savings, faster convergence, and strong few-shot gains on RoBERTa-large and GPT2-family models, outperforming both gradient-based parameter-efficient tuning methods and prior gradient-free baselines across seven NLU tasks and multiple model sizes.

Abstract

Parameter-efficient tuning methods such as LoRA could achieve comparable performance to model tuning by tuning a small portion of the parameters. However, substantial computational resources are still required, as this process involves calculating gradients and performing back-propagation throughout the model. Much effort has recently been devoted to utilizing the derivative-free optimization method to eschew the computation of gradients and showcase an augmented level of robustness in few-shot settings. In this paper, we prepend the low-rank modules into each self-attention layer of the model and employ two derivative-free optimization methods to optimize these low-rank modules at each layer alternately. Extensive results on various tasks and language models demonstrate that our proposed method achieves substantial improvement and exhibits clear advantages in memory usage and convergence speed compared to existing gradient-based parameter-efficient tuning and derivative-free optimization methods in few-shot settings.
Paper Structure (24 sections, 6 equations, 4 figures, 6 tables, 1 algorithm)

This paper contains 24 sections, 6 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: The results of our proposed methods F-LoRA and C-LoRA compared to gradient-based and gradient-free methods on average performance over seven language understanding tasks. We evaluate all the methods on RoBERTa-large.
  • Figure 2: An illustration of derivative-free optimization for low-rank adaptation. We apply the low-rank matrices (green boxes) at the self-attention module of each layer and initialize them with model-specific normal distributions. We use two derivative-free methods (e.g., CMA-ES and Firework algorithm) to alternately optimize low-rank modules at the self-attention module of each layer.
  • Figure 3: The results of different dimensions on the SST2 and SNLI datasets with GPT2-XL model.
  • Figure 4: The results of different low-rank $r$ on the SST2 and Yelpp datasets with GPT2-XL model.