Table of Contents
Fetching ...

Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning

Haobo Song, Hao Zhao, Soumajit Majumder, Tao Lin

TL;DR

CapaBoost introduces a simple, plug-in strategy to increase the effective capacity of parameter-efficient fine-tuning by using multiple parallel, weight-tied updates with deterministic random masks. Theoretical and empirical results show that the approach expands the effective rank of incremental updates, enabling higher performance without increasing trainable parameters or FLOPs. Across NLP and vision benchmarks (GLUE, SQuAD, VTAB), CapaBoost variants (notably LoRA and PAdapter) consistently outperform strong PEFT baselines while maintaining or reducing budgeted parameters, and demonstrate hardware-friendly sparse computation. The work suggests a practical path to scaling PEFT performance for large models and offers insights into rank-driven capacity gains and mask design for future exploration.

Abstract

Fine-tuning large pre-trained foundation models, such as the 175B GPT-3, has attracted more attention for downstream tasks recently. While parameter-efficient fine-tuning methods have been proposed and proven effective without retraining all model parameters, their performance is limited by the capacity of incremental modules, especially under constrained parameter budgets. \\ To overcome this challenge, we propose CapaBoost, a simple yet effective strategy that enhances model capacity by leveraging low-rank updates through parallel weight modules in target layers. By applying static random masks to the shared weight matrix, CapaBoost constructs a diverse set of weight matrices, effectively increasing the rank of incremental weights without adding parameters. Notably, our approach can be seamlessly integrated into various existing parameter-efficient fine-tuning methods. We extensively validate the efficacy of CapaBoost through experiments on diverse downstream tasks, including natural language understanding, question answering, and image classification. Our results demonstrate significant improvements over baselines, without incurring additional computation or storage costs. Our code is available at \url{https://github.com/LINs-lab/CapaBoost}.

Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning

TL;DR

CapaBoost introduces a simple, plug-in strategy to increase the effective capacity of parameter-efficient fine-tuning by using multiple parallel, weight-tied updates with deterministic random masks. Theoretical and empirical results show that the approach expands the effective rank of incremental updates, enabling higher performance without increasing trainable parameters or FLOPs. Across NLP and vision benchmarks (GLUE, SQuAD, VTAB), CapaBoost variants (notably LoRA and PAdapter) consistently outperform strong PEFT baselines while maintaining or reducing budgeted parameters, and demonstrate hardware-friendly sparse computation. The work suggests a practical path to scaling PEFT performance for large models and offers insights into rank-driven capacity gains and mask design for future exploration.

Abstract

Fine-tuning large pre-trained foundation models, such as the 175B GPT-3, has attracted more attention for downstream tasks recently. While parameter-efficient fine-tuning methods have been proposed and proven effective without retraining all model parameters, their performance is limited by the capacity of incremental modules, especially under constrained parameter budgets. \\ To overcome this challenge, we propose CapaBoost, a simple yet effective strategy that enhances model capacity by leveraging low-rank updates through parallel weight modules in target layers. By applying static random masks to the shared weight matrix, CapaBoost constructs a diverse set of weight matrices, effectively increasing the rank of incremental weights without adding parameters. Notably, our approach can be seamlessly integrated into various existing parameter-efficient fine-tuning methods. We extensively validate the efficacy of CapaBoost through experiments on diverse downstream tasks, including natural language understanding, question answering, and image classification. Our results demonstrate significant improvements over baselines, without incurring additional computation or storage costs. Our code is available at \url{https://github.com/LINs-lab/CapaBoost}.
Paper Structure (54 sections, 4 theorems, 12 equations, 4 figures, 17 tables)

This paper contains 54 sections, 4 theorems, 12 equations, 4 figures, 17 tables.

Key Result

Theorem 3.1

Assume two matrices $\mathbf{X}$ and $\mathbf{Y}$ are randomly generated by $\mathbf{X} = \mathbf{X}^{\textrm{col}} \mathbf{X}^{\textrm{row}}$ and $\mathbf{Y} = \mathbf{Y}^{\textrm{col}} \mathbf{Y}^{\textrm{row}}$ respectively. $\mathbf{X}^{\textrm{col}} := [\mathbf{x}^{\textrm{col}}_1, \mathbf{x}^{

Figures (4)

  • Figure 1: Rank and Performance comparison among several PEFT methods in GLUE test with RoBERTa base model. The left figure shows the performance and parameter numbers (shown as percentage to fully fine-tuning) of different methods and indicates CapaBoost is the best. The right figure shows the rank and parameter numbers of several methods and CapaBoost has the highest rank among similar PEFT methods.
  • Figure 2: The framework of CapaBoost.(a): Diagram of CapaBoost learning with $\textbf{d=2}$. After applying blue and red pruning masks to the original $\mathbf{w}$ in (1), we obtain $\mathbf{w} \odot \mathbf{m}_{\text{blue}}$ in (2) and $\mathbf{w} \odot \mathbf{m}_{\text{red}}$ in (3), both sharing the same dense weight as (1). Since (2) and (3) have common pruned weights, we can exclude common pruned weights from the original $\mathbf{w}$ and store the sparse weights in (4), benefitting from fewer parameter numbers. During the training, we can retrieve weights from (4) and apply respective masks $\{ \mathbf{m}_i \}$ to obtain weight in (2) and (3). (b): Diagram of CapaBoost example in LoRA and Adapter.
  • Figure 3: Ablation study for components of CapaBoost-LoRA on RoBERTa-base model.(a) Average performance of CapaBoost-LoRA with different pruning masks, same pruning mask, and only Dropout without pruning over different sparsity on CoLA dataset. We use two parallel tied modules with a preset LoRA inner dimension of $8$. (b) Average performance of CapaBoost-LoRA under different rank values and number of parallel tied modules when density$=0.5$ on CoLA dataset. Results are averaged over three trials.
  • Figure 4: Ablation study for components of CapaBoost-LoRA on RoBERTa-base model.(a) Average performance of CapaBoost-LoRA with different masks, same mask, and Dropout over different sparsity on SST-2-10k dataset. We use two tied layers with a preset LoRA rank value of $8$. (b) Average performance of CapaBoost-LoRA under different rank values and number of tied layers when density$=0.5$ on SST-2-10k dataset. Results are averaged over three trials.

Theorems & Definitions (8)

  • Theorem 3.1
  • Remark 3.2
  • Remark 3.3: The consistency between intuition and practice
  • Theorem A.1
  • Lemma A.2: marsaglia1964bounds
  • Lemma A.3: bogachev2007measure
  • proof
  • Remark A.4