AdaFish: Fast low-rank parameter-efficient fine-tuning by using second-order information

Jiang Hu; Quanzheng Li

AdaFish: Fast low-rank parameter-efficient fine-tuning by using second-order information

Jiang Hu, Quanzheng Li

TL;DR

This work introduces AdaFish, an efficient algorithm of the second-order type designed to expedite the training process within low-rank decomposition-based fine-tuning frameworks and proves the global convergence of AdaFish, along with its iteration/oracle complexity.

Abstract

Recent advancements in large-scale pretrained models have significantly improved performance across a variety of tasks in natural language processing and computer vision. However, the extensive number of parameters in these models necessitates substantial memory and computational resources for full training. To adapt these models for downstream tasks or specific application-oriented datasets, parameter-efficient fine-tuning methods leveraging pretrained parameters have gained considerable attention. However, it can still be time-consuming due to lots of parameters and epochs. In this work, we introduce AdaFish, an efficient algorithm of the second-order type designed to expedite the training process within low-rank decomposition-based fine-tuning frameworks. Our key observation is that the associated generalized Fisher information matrix is either low-rank or extremely small-scaled. Such a generalized Fisher information matrix is shown to be equivalent to the Hessian matrix. Moreover, we prove the global convergence of AdaFish, along with its iteration/oracle complexity. Numerical experiments show that our algorithm is quite competitive with the state-of-the-art AdamW method.

AdaFish: Fast low-rank parameter-efficient fine-tuning by using second-order information

TL;DR

Abstract

Paper Structure (25 sections, 2 theorems, 42 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 2 theorems, 42 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Contributions
Notations
Preliminaries
LoRA
Parameter-efficient fine-tuning based on tensor low-rank decompositions
Fisher information matrix based stochastic optimization methods
An efficient second-order algorithm for fast training
Generalized Fisher information matrix
Connection with Hessian matrix
Natural gradient direction
Our algorithm: AdaFish
Extension to low-rank tensor decomposition based fine-tuning
Convergence analysis
Numerical results
...and 10 more sections

Key Result

Lemma 1

Assume that each column of the sample gradient $g = g(\theta;x,y) \in \mathbb{R}^{r \times n}$ is independent and identically distributed random vector with zero mean under the distribution $p(y \mid x, \theta)$ for any $\theta$. We have: In addition, it holds:

Figures (3)

Figure 1: Training losses and testing accuracies on Cifar, Caltech, Oxford_flowers, and DTD from Vtab-1k. AdaFish shows over 2x faster convergence in training losses and superior test accuracy, indicating enhanced generalization performance.
Figure 2: Training losses and testing accuracies on Resisc45 from Vtab-1k. AdaFish shows over 2x faster convergence in training losses and superior test accuracy, indicating enhanced generalization performance.
Figure 3: Testing rouge scores obtained by AdamW and AdaFish.

Theorems & Definitions (3)

Lemma 1
Theorem 2
Remark 3

AdaFish: Fast low-rank parameter-efficient fine-tuning by using second-order information

TL;DR

Abstract

AdaFish: Fast low-rank parameter-efficient fine-tuning by using second-order information

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (3)