Table of Contents
Fetching ...

Towards Adaptive Prefix Tuning for Parameter-Efficient Language Model Fine-tuning

Zhen-Ru Zhang, Chuanqi Tan, Haiyang Xu, Chengyu Wang, Jun Huang, Songfang Huang

TL;DR

<3-5 sentence high-level summary> The paper addresses the high cost of full fine-tuning by revisiting prefix tuning and introducing Adaptive Prefix Tuning (APT), which uses token-level and layer-level gates to adapt prefixes per Transformer layer. The method explicitly accounts for layer-wise differences in feature representations, enabling more efficient and effective task adaptation. Empirical results on SuperGLUE and NER across multiple backbones show that APT consistently outperforms P-Tuning v2, including in few-shot settings, and the analyses reveal meaningful weight distributions aligned with task properties. The work demonstrates that adaptively gated prefixes can yield better parameter-efficient fine-tuning and suggests directions for generalizing adaptive strategies to other architectures.

Abstract

Fine-tuning large pre-trained language models on various downstream tasks with whole parameters is prohibitively expensive. Hence, Parameter-efficient fine-tuning has attracted attention that only optimizes a few task-specific parameters with the frozen pre-trained model. In this work, we focus on prefix tuning, which only optimizes continuous prefix vectors (i.e. pseudo tokens) inserted into Transformer layers. Based on the observation that the learned syntax and semantics representation varies a lot at different layers, we argue that the adaptive prefix will be further tailored to each layer than the fixed one, enabling the fine-tuning more effective and efficient. Thus, we propose Adaptive Prefix Tuning (APT) to adjust the prefix in terms of both fine-grained token level and coarse-grained layer level with a gate mechanism. Experiments on the SuperGLUE and NER datasets show the effectiveness of APT. In addition, taking the gate as a probing, we validate the efficiency and effectiveness of the variable prefix.

Towards Adaptive Prefix Tuning for Parameter-Efficient Language Model Fine-tuning

TL;DR

<3-5 sentence high-level summary> The paper addresses the high cost of full fine-tuning by revisiting prefix tuning and introducing Adaptive Prefix Tuning (APT), which uses token-level and layer-level gates to adapt prefixes per Transformer layer. The method explicitly accounts for layer-wise differences in feature representations, enabling more efficient and effective task adaptation. Empirical results on SuperGLUE and NER across multiple backbones show that APT consistently outperforms P-Tuning v2, including in few-shot settings, and the analyses reveal meaningful weight distributions aligned with task properties. The work demonstrates that adaptively gated prefixes can yield better parameter-efficient fine-tuning and suggests directions for generalizing adaptive strategies to other architectures.

Abstract

Fine-tuning large pre-trained language models on various downstream tasks with whole parameters is prohibitively expensive. Hence, Parameter-efficient fine-tuning has attracted attention that only optimizes a few task-specific parameters with the frozen pre-trained model. In this work, we focus on prefix tuning, which only optimizes continuous prefix vectors (i.e. pseudo tokens) inserted into Transformer layers. Based on the observation that the learned syntax and semantics representation varies a lot at different layers, we argue that the adaptive prefix will be further tailored to each layer than the fixed one, enabling the fine-tuning more effective and efficient. Thus, we propose Adaptive Prefix Tuning (APT) to adjust the prefix in terms of both fine-grained token level and coarse-grained layer level with a gate mechanism. Experiments on the SuperGLUE and NER datasets show the effectiveness of APT. In addition, taking the gate as a probing, we validate the efficiency and effectiveness of the variable prefix.
Paper Structure (21 sections, 4 equations, 3 figures, 5 tables)

This paper contains 21 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: An illustration of the proposed approach APT where the left is the internal structure of Transformer with inserted prefixes, and the right is the schematic of prefix gate mechanism.
  • Figure 2: Visualization of the learned weights of the prefix token for SuperGLUE task COPA on BERT-large and NER task CoNLL04 on BERT-base, with darker colors indicating higher weights.
  • Figure 3: The performance of APT and PT-2 on COPA and WSC in a range of prefix length on BERT-large.