Table of Contents
Fetching ...

SPAFIT: Stratified Progressive Adaptation Fine-tuning for Pre-trained Large Language Models

Samir Arora, Liangliang Wang

TL;DR

This work tackles the high cost and potential forgetting of full fine-tuning by proposing SPAFIT, a stratified progressive fine-tuning method that allocates layers into three groups with increasing tuning complexity. Group 1 is frozen, Group 2 updates only bias terms, and Group 3 applies LoRA to selected attention components while using bias updates elsewhere, enabling about 1–2% of parameters to be trained. Evaluated on nine GLUE tasks with BERT-large-cased, SPAFIT variants achieve performance comparable to or better than full fine-tuning and LoRA while tuning far fewer parameters, though CoLA and SST-2 remain challenging for PEFT. The approach advances practical, memory-efficient adaptation of large language models and motivates further exploration of layer-wise linguistic knowledge localization across encoder–decoder architectures and more diverse tasks.

Abstract

Full fine-tuning is a popular approach to adapt Transformer-based pre-trained large language models to a specific downstream task. However, the substantial requirements for computational power and storage have discouraged its widespread use. Moreover, increasing evidence of catastrophic forgetting and overparameterization in the Transformer architecture has motivated researchers to seek more efficient fine-tuning (PEFT) methods. Commonly known parameter-efficient fine-tuning methods like LoRA and BitFit are typically applied across all layers of the model. We propose a PEFT method, called Stratified Progressive Adaptation Fine-tuning (SPAFIT), based on the localization of different types of linguistic knowledge to specific layers of the model. Our experiments, conducted on nine tasks from the GLUE benchmark, show that our proposed SPAFIT method outperforms other PEFT methods while fine-tuning only a fraction of the parameters adjusted by other methods.

SPAFIT: Stratified Progressive Adaptation Fine-tuning for Pre-trained Large Language Models

TL;DR

This work tackles the high cost and potential forgetting of full fine-tuning by proposing SPAFIT, a stratified progressive fine-tuning method that allocates layers into three groups with increasing tuning complexity. Group 1 is frozen, Group 2 updates only bias terms, and Group 3 applies LoRA to selected attention components while using bias updates elsewhere, enabling about 1–2% of parameters to be trained. Evaluated on nine GLUE tasks with BERT-large-cased, SPAFIT variants achieve performance comparable to or better than full fine-tuning and LoRA while tuning far fewer parameters, though CoLA and SST-2 remain challenging for PEFT. The approach advances practical, memory-efficient adaptation of large language models and motivates further exploration of layer-wise linguistic knowledge localization across encoder–decoder architectures and more diverse tasks.

Abstract

Full fine-tuning is a popular approach to adapt Transformer-based pre-trained large language models to a specific downstream task. However, the substantial requirements for computational power and storage have discouraged its widespread use. Moreover, increasing evidence of catastrophic forgetting and overparameterization in the Transformer architecture has motivated researchers to seek more efficient fine-tuning (PEFT) methods. Commonly known parameter-efficient fine-tuning methods like LoRA and BitFit are typically applied across all layers of the model. We propose a PEFT method, called Stratified Progressive Adaptation Fine-tuning (SPAFIT), based on the localization of different types of linguistic knowledge to specific layers of the model. Our experiments, conducted on nine tasks from the GLUE benchmark, show that our proposed SPAFIT method outperforms other PEFT methods while fine-tuning only a fraction of the parameters adjusted by other methods.
Paper Structure (16 sections, 5 equations, 2 figures, 2 tables)

This paper contains 16 sections, 5 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Layer-wise breakdown of an encoder layer of a particular implementation of BERT used for experiments.
  • Figure 2: An example implementation of SPAFIT on BERT.