Table of Contents
Fetching ...

Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning

Wenhan Xia, Chengwei Qin, Elad Hazan

TL;DR

This work introduces Chain of LoRA (COLA), an iterative residual-learning framework that extends low-rank adaptation (LoRA) by chaining multiple LoRA updates to better approximate the optimal task-specific weight updates without increasing compute or memory requirements. Grounded in a Frank-Wolfe-inspired optimization perspective, COLA alternates between tuning LoRA modules, merging them into the frozen base, and extending the chain with new modules to learn residuals. The authors provide convergence analysis for a stochastic nonconvex setting and demonstrate consistent, substantial improvements over LoRA across OPT-1.3B and Llama-2-7B on seven tasks, with favorable ablations on chain length and rank-decay strategy. The approach preserves inference latency and shows promise for scalable, efficient fine-tuning of large language models in practical settings.

Abstract

Fine-tuning is the primary methodology for tailoring pre-trained large language models to specific tasks. As the model's scale and the diversity of tasks expand, parameter-efficient fine-tuning methods are of paramount importance. One of the most widely used family of methods is low-rank adaptation (LoRA) and its variants. LoRA encodes weight update as the product of two low-rank matrices. Despite its advantages, LoRA falls short of full-parameter fine-tuning in terms of generalization error for certain tasks. We introduce Chain of LoRA (COLA), an iterative optimization framework inspired by the Frank-Wolfe algorithm, to bridge the gap between LoRA and full parameter fine-tuning, without incurring additional computational costs or memory overheads. COLA employs a residual learning procedure where it merges learned LoRA modules into the pre-trained language model parameters and re-initilize optimization for new born LoRA modules. We provide theoretical convergence guarantees as well as empirical results to validate the effectiveness of our algorithm. Across various models (OPT and llama-2) and seven benchmarking tasks, we demonstrate that COLA can consistently outperform LoRA without additional computational or memory costs.

Chain of LoRA: Efficient Fine-tuning of Language Models via Residual Learning

TL;DR

This work introduces Chain of LoRA (COLA), an iterative residual-learning framework that extends low-rank adaptation (LoRA) by chaining multiple LoRA updates to better approximate the optimal task-specific weight updates without increasing compute or memory requirements. Grounded in a Frank-Wolfe-inspired optimization perspective, COLA alternates between tuning LoRA modules, merging them into the frozen base, and extending the chain with new modules to learn residuals. The authors provide convergence analysis for a stochastic nonconvex setting and demonstrate consistent, substantial improvements over LoRA across OPT-1.3B and Llama-2-7B on seven tasks, with favorable ablations on chain length and rank-decay strategy. The approach preserves inference latency and shows promise for scalable, efficient fine-tuning of large language models in practical settings.

Abstract

Fine-tuning is the primary methodology for tailoring pre-trained large language models to specific tasks. As the model's scale and the diversity of tasks expand, parameter-efficient fine-tuning methods are of paramount importance. One of the most widely used family of methods is low-rank adaptation (LoRA) and its variants. LoRA encodes weight update as the product of two low-rank matrices. Despite its advantages, LoRA falls short of full-parameter fine-tuning in terms of generalization error for certain tasks. We introduce Chain of LoRA (COLA), an iterative optimization framework inspired by the Frank-Wolfe algorithm, to bridge the gap between LoRA and full parameter fine-tuning, without incurring additional computational costs or memory overheads. COLA employs a residual learning procedure where it merges learned LoRA modules into the pre-trained language model parameters and re-initilize optimization for new born LoRA modules. We provide theoretical convergence guarantees as well as empirical results to validate the effectiveness of our algorithm. Across various models (OPT and llama-2) and seven benchmarking tasks, we demonstrate that COLA can consistently outperform LoRA without additional computational or memory costs.
Paper Structure (19 sections, 1 theorem, 11 equations, 3 figures, 4 tables, 2 algorithms)

This paper contains 19 sections, 1 theorem, 11 equations, 3 figures, 4 tables, 2 algorithms.

Key Result

Theorem 4.1

Algorithm alg:condgrad applied to a sequence of stochastic gradients of $\beta$-smooth nonconvex functions that are bounded in $\mathcal{K}$ by $M$, with step sizes $\eta_t = \frac{\sqrt{M}}{D \sqrt{\beta T}}$ attains the following convergence guarantee

Figures (3)

  • Figure 1: An illustration of Chain of LoRA. Our approach starts with a frozen LLM, and learns a sequence of low-rank matrices to approximate a high-rank augmentation to perform task adaptation. As shown in the dashed line box, each residual learning procedure consists of three steps: (1) LoRA Tuning, (2) Tie a knot, and (3) Extend the chain. In step 1, low-rank LoRA modules are fine-tuned, In step 2, the learned LoRA weights are merged into the frozen model. In step 3, a new LoRA module is instantiated and the optimizer state is reset. These three steps are repeated in this residual learning paradigm.
  • Figure 2: Ablation study on test accuracy with different COLA chain length across tasks. Results are reported after averaging five different seeds and the shaded area corresponds to standard deviation. The general trend is that the test accuracy increases with the chain length.
  • Figure 3: COLA with rank step-down for three tasks. Experiments are conducted with COLA of length 2 where ($A_1$, $B_1$) has fixed rank of 8, and ($A_2$, $B_2)$ rank is as shown in the figure.

Theorems & Definitions (2)

  • Theorem 4.1
  • proof