Table of Contents
Fetching ...

AdaFish: Fast low-rank parameter-efficient fine-tuning by using second-order information

Jiang Hu, Quanzheng Li

TL;DR

This work introduces AdaFish, an efficient algorithm of the second-order type designed to expedite the training process within low-rank decomposition-based fine-tuning frameworks and proves the global convergence of AdaFish, along with its iteration/oracle complexity.

Abstract

Recent advancements in large-scale pretrained models have significantly improved performance across a variety of tasks in natural language processing and computer vision. However, the extensive number of parameters in these models necessitates substantial memory and computational resources for full training. To adapt these models for downstream tasks or specific application-oriented datasets, parameter-efficient fine-tuning methods leveraging pretrained parameters have gained considerable attention. However, it can still be time-consuming due to lots of parameters and epochs. In this work, we introduce AdaFish, an efficient algorithm of the second-order type designed to expedite the training process within low-rank decomposition-based fine-tuning frameworks. Our key observation is that the associated generalized Fisher information matrix is either low-rank or extremely small-scaled. Such a generalized Fisher information matrix is shown to be equivalent to the Hessian matrix. Moreover, we prove the global convergence of AdaFish, along with its iteration/oracle complexity. Numerical experiments show that our algorithm is quite competitive with the state-of-the-art AdamW method.

AdaFish: Fast low-rank parameter-efficient fine-tuning by using second-order information

TL;DR

This work introduces AdaFish, an efficient algorithm of the second-order type designed to expedite the training process within low-rank decomposition-based fine-tuning frameworks and proves the global convergence of AdaFish, along with its iteration/oracle complexity.

Abstract

Recent advancements in large-scale pretrained models have significantly improved performance across a variety of tasks in natural language processing and computer vision. However, the extensive number of parameters in these models necessitates substantial memory and computational resources for full training. To adapt these models for downstream tasks or specific application-oriented datasets, parameter-efficient fine-tuning methods leveraging pretrained parameters have gained considerable attention. However, it can still be time-consuming due to lots of parameters and epochs. In this work, we introduce AdaFish, an efficient algorithm of the second-order type designed to expedite the training process within low-rank decomposition-based fine-tuning frameworks. Our key observation is that the associated generalized Fisher information matrix is either low-rank or extremely small-scaled. Such a generalized Fisher information matrix is shown to be equivalent to the Hessian matrix. Moreover, we prove the global convergence of AdaFish, along with its iteration/oracle complexity. Numerical experiments show that our algorithm is quite competitive with the state-of-the-art AdamW method.
Paper Structure (25 sections, 2 theorems, 42 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 25 sections, 2 theorems, 42 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

Assume that each column of the sample gradient $g = g(\theta;x,y) \in \mathbb{R}^{r \times n}$ is independent and identically distributed random vector with zero mean under the distribution $p(y \mid x, \theta)$ for any $\theta$. We have: In addition, it holds:

Figures (3)

  • Figure 1: Training losses and testing accuracies on Cifar, Caltech, Oxford_flowers, and DTD from Vtab-1k. AdaFish shows over 2x faster convergence in training losses and superior test accuracy, indicating enhanced generalization performance.
  • Figure 2: Training losses and testing accuracies on Resisc45 from Vtab-1k. AdaFish shows over 2x faster convergence in training losses and superior test accuracy, indicating enhanced generalization performance.
  • Figure 3: Testing rouge scores obtained by AdamW and AdaFish.

Theorems & Definitions (3)

  • Lemma 1
  • Theorem 2
  • Remark 3