See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition

Chongjie Si; Xiaokang Yang; Wei Shen

See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition

Chongjie Si, Xiaokang Yang, Wei Shen

TL;DR

This work addresses the lack of a unified theoretical basis for parameter-efficient fine-tuning (PEFT) by introducing Subspace Tuning, a decomposition-based framework that treats PEFT as subspace manipulation of frozen weights. It unifies reconstruction, extension, and their combination under a common formalism and derives mathematical principles explaining performance differences among methods. Building on this theory, the authors propose two novel PEFT approaches and a practical framework (MPC) to boost performance without extra parameters, achieving near full fine-tuning with extremely small parameter budgets (e.g., 0.02%–1% of parameters) across three large pretrained models. The results provide both theoretical insight and practical improvements, suggesting how to design more expressive and stable PEFT methods for diverse tasks and resource-constrained settings.

Abstract

The rapid expansion of large foundation models within the pre-training and fine-tuning framework has underscored that larger models often yield better results. However, the scaling up of large foundation models has led to soaring costs in fine-tuning and parameter storage, rendering extensive adaptations impractical. This challenge has sparked the development of parameter-efficient fine-tuning (PEFT), which focuses on optimizing a select subset of parameters while keeping the rest fixed, significantly lowering computational and storage overheads. While recent years have witnessed a significant success in PEFT, a deep understanding of the fundamental principles behind these methods remains unexplored. To this end, here we take the first step to unify all approaches by dissecting them from a decomposition perspective. We initiate a comprehensive mathematical analysis of these methods, allowing us to delve deeply into their underlying mechanisms, and we explore the reasons behind the variations in performance among different techniques. Furthermore, inspired by our theoretical analysis, we introduce two novel PEFT methods alongside a simple yet effective framework designed to enhance the performance of PEFT techniques across various applications. Our empirical validations, conducted across multiple datasets, demonstrate the efficacy of these methods, showcasing both theoretical validity and practical performance improvements under the guidance of our analytical findings. We believe our work will deepen researchers' understanding of PEFT and other techniques, prompting further contemplation and advancing the research across the whole community.

See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition

TL;DR

Abstract

Paper Structure (13 sections, 2 theorems, 40 equations, 5 figures, 6 tables)

This paper contains 13 sections, 2 theorems, 40 equations, 5 figures, 6 tables.

Introduction
Subspace Tuning
Subspace Reconstruction
Singular Value Adjustment
Singular Value Adjustment
Scaling
Nonlinear Mapping
Subspace Extension
LoRA and its Derivatives
Matrix Pattern Constraints (MPC)
Adapter Derivatives
Subspace Combination
Conclusion and Discussion

Key Result

Proposition 1

The expressiveness and optimization landscape of extension-based methods are directly influenced by the decomposition structure of $\Delta \mathbf{W}$, rather than by its mathematical equivalence to other forms.

Figures (5)

Figure 1: Framework of subspace tuning. a, Subspace tuning endeavors to identify the maximal projection of the optimal weight $\mathbf{W}^{*}$ onto the subspace spanned by the bases of $\phi(\mathbf{W})$. Here, $\phi(\mathbf{W})$ denotes the subspace transformation of the original frozen weight $\mathbf{W}$. b, Subspace reconstruction involves rescaling the subspace of $\mathbf{W}$ to approximate $\mathbf{W}^{*}$, or to construct a new subspace derived from the original. Subspace extension seeks to adjust the subspace of the original weight $\mathbf{W}$ such that it approaches or even encompasses $\mathbf{W}^{*}$. Subspace combination encompasses both the reconstruction and extension of subspaces. c, A numerical perspective on subspace tuning. Reconstruction involves modifying the frozen parameters, while extension entails adding new tunable parameters.
Figure 2: a, Subspace view of reconstruction-based methods. Fine-tuning the singular values involves rescaling the weights, while fine-tuning the singular vectors effectively reconstructs the subspace. b. Numerical view of reconstruction-based methods. We correspond adjustments in the subspace directly to their numerical adjustments. c, The performance of reconstruction-based methods. With less than 0.1% of the parameters of the pretrained model, SSL and SSB can achieve up to 99% of the performance of fully fine-tuning. The horizontal dashed line parallel to the x-axis, labeled FT, represents the performance of fully fine-tuning. The average scores of each method are evaluated with three large pretrained models, RoBERTa-base liu2019roberta, DeBERTaV3-base he2021debertav3, and RoBERTa-large liu2019roberta on the GLUE benchmark. Error bars represent the standard error of the mean across five runs.
Figure 3: Subspace and Numerical views of extension-based methods. Extension-based methods introduce an additional weight matrix and then try to find the optimal weight projection within the subspace spanned by this additional weight and the original weight. To achieve this, the basis of the subspace constructed by the additional matrix should complement the basis of the original weights as much as possible. The right figure lists some common extension-based methods and their operations on matrices.
Figure 4: Average score of extension and combination-based methods. Each method is assessed under four different ranks. The horizontal dashed line parallel to the x-axis, labeled FT, represents the performance of fully fine-tuning. In general the performance of FLoRA is superior to that of AdaLoRA and TriLoRA, followed by LoRA, and the performance of DoRA is superior to that of LoRA. The average scores of PEFT methods are evaluated with three large pretrained models, RoBERTa-base liu2019roberta, DeBERTaV3-base he2021debertav3, and RoBERTa-large liu2019roberta on the GLUE benchmark. Each method is assessed under four different ranks 2, 4, 8 and 16. Error bars represent the standard error of the mean across five runs.
Figure 5: Average score of different methods coupled with MPC framework. a, The performance of LoRA when coupled with MPC$_o$, MPC$_d$, and MPC$_n$. b-c, The performance of TriLoRA and AdaLoRA when coupled with MPC$_o$, respectively. The MPC framework significantly enhances the performance of various PEFT methods, as evaluated with three large pretrained models, RoBERTa-base liu2019roberta, DeBERTaV3-base he2021debertav3, and RoBERTa-large liu2019roberta on the GLUE benchmark. Each method is assessed under four different ranks 2, 4, 8 and 16. Error bars represent the standard error of the mean across five runs.

Theorems & Definitions (2)

Proposition 1
Proposition 2

See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition

TL;DR

Abstract

See Further for Parameter Efficient Fine-tuning by Standing on the Shoulders of Decomposition

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (2)