Table of Contents
Fetching ...

Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs

Yifei Zhang, Hao Zhu, Aiwei Liu, Han Yu, Piotr Koniusz, Irwin King

Abstract

Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks. However, the enormous size of LLMs poses significant challenges in terms of computational complexity and resource requirements. Low-Rank Adaptation (LoRA) has emerged as a promising solution. However, there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum. In this work, we propose eXtreme Gradient Boosting LoRA (XGBLoRA), a novel framework that bridges this gap by leveraging the power of ensemble learning. Inspired by gradient boosting, XGBLoRA iteratively learns and merges a sequence of LoRA adaptations to refine model predictions. It achieves better performance than the standard LoRA, while enjoying the computational efficiency of rank-1 adaptations. We provide theoretical analysis to show the convergence and optimality of our approach, and conduct extensive experiments on a range of natural language processing tasks. The results demonstrate that XGBLoRA consistently outperforms standard LoRA and achieves performance comparable to full fine-tuning with significantly fewer trainable parameters. This work advances parameter-efficient fine-tuning for LLMs, and offers a promising solution for adapting LLMs to downstream tasks while optimizing performance and efficiency.

Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs

Abstract

Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks. However, the enormous size of LLMs poses significant challenges in terms of computational complexity and resource requirements. Low-Rank Adaptation (LoRA) has emerged as a promising solution. However, there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum. In this work, we propose eXtreme Gradient Boosting LoRA (XGBLoRA), a novel framework that bridges this gap by leveraging the power of ensemble learning. Inspired by gradient boosting, XGBLoRA iteratively learns and merges a sequence of LoRA adaptations to refine model predictions. It achieves better performance than the standard LoRA, while enjoying the computational efficiency of rank-1 adaptations. We provide theoretical analysis to show the convergence and optimality of our approach, and conduct extensive experiments on a range of natural language processing tasks. The results demonstrate that XGBLoRA consistently outperforms standard LoRA and achieves performance comparable to full fine-tuning with significantly fewer trainable parameters. This work advances parameter-efficient fine-tuning for LLMs, and offers a promising solution for adapting LLMs to downstream tasks while optimizing performance and efficiency.

Paper Structure

This paper contains 16 sections, 10 theorems, 48 equations, 4 figures, 7 tables.

Key Result

Lemma 1

The XGBLoRA update approximates the full gradient update with error: where $r$ is the LoRA rank, $M$ is the number of minibatches, and $C_1, C_2$ are constants depending on the properties of $\mathcal{L}$ and the gradient variance, respectively. (The complete proof is in the Appendix.)

Figures (4)

  • Figure 1: Efficiency vs. effectiveness on the GLUE dataset. Our XGBLoRA enjoys high average and uses fewer parameters than competitors. Mini-figure: speed in seconds per batch.
  • Figure 2: We prove by the error bound in Th. \ref{['th:expr']} that by compensating for low rank $r\!=\!1$ updates by GB #iterations $T$, XGBLoRA's $\texttt{err}\!\leq C(\!1\!+\!\frac{1}{\sqrt{T}})$ is close to LoRA's $\texttt{err}\!\leq C(\frac{1}{r}\!+\!1)$. Mini-figure:XGBLoRA consumes only $\mathcal{O}(1)$ memory for updates while LoRA consumes $\mathcal{O}(r)$.$\!$
  • Figure 3: The pipeline of XGBLoRA: A booster is constructed via randomly choosing $L_s=2$ adapter layers. Then, it is trained for $\kappa$ steps before merging with the base model. The next booster is then learnt.
  • Figure 4: Performance of XGBLoRA with varying $\kappa=\frac{K}{T}$ for LLaMA3-8B, Mistral-7B, and LLaMA2-13B base models. Perfomance of LoRA is marked as red dash line

Theorems & Definitions (17)

  • Lemma 1: XGBLoRA Gradient Approximation.
  • Lemma 2: Accumulated Update Bound.
  • Lemma 3: Gradient Lipschitz Continuity.
  • Theorem 1: XGBLoRA Convergence.
  • Remark 1
  • Theorem 2: XGBLoRA Expressiveness Error.
  • Remark 2
  • Lemma 4: XGBLoRA Gradient Approximation
  • Proof 1
  • Lemma 5: Accumulated Update Bound
  • ...and 7 more