Table of Contents
Fetching ...

ADePT: Adaptive Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

Pengwei Tang, Xiaolin Hu, Yong Liu

TL;DR

The paper introduces Adaptive Decomposed Prompt Tuning (ADePT), a parameter-efficient fine-tuning method that learns adaptive, per-token embedding offsets via a token-shared shallow feed-forward network, while preserving a short soft prompt. By replacing DePT's fixed, position-based offsets with a function f(e) learned for each token, ADePT achieves higher expressive power and better generalization without increasing trainable parameters. Empirical results across 23 NLP tasks and 4 PLMs show that ADePT surpasses leading PT- and DePT-based methods and, in some cases, even outperforms full fine-tuning, with competitive or faster inference. A theoretical analysis demonstrates that ADePT can alter attention patterns in the first transformer layer, yielding stronger adaptation capabilities, and code is provided for reproducibility.

Abstract

Prompt Tuning (PT) enables the adaptation of Pre-trained Large Language Models (PLMs) to downstream tasks by optimizing a small amount of soft virtual tokens, which are prepended to the input token embeddings. Recently, Decomposed Prompt Tuning (DePT) has demonstrated superior adaptation capabilities by decomposing the soft prompt into a shorter soft prompt and a pair of low-rank matrices. The product of the pair of low-rank matrices is added to the input token embeddings to offset them. Additionally, DePT achieves faster inference compared to PT due to the shorter soft prompt. However, in this paper, we find that the position-based token embedding offsets of DePT restrict its ability to generalize across diverse model inputs, and that the shared embedding offsets across many token embeddings result in sub-optimization. To tackle these issues, we introduce Adaptive Decomposed Prompt Tuning (ADePT), which is composed of a short soft prompt and a shallow token-shared feed-forward neural network. ADePT utilizes the token-shared feed-forward neural network to learn the embedding offsets for each token, enabling adaptive embedding offsets that vary according to the model input and better optimization of token embedding offsets. This enables ADePT to achieve superior adaptation performance without requiring more inference time or additional trainable parameters compared to vanilla PT and its variants. In comprehensive experiments across 23 natural language processing tasks and 4 typical PLMs of different scales, ADePT consistently surpasses the other leading parameter-efficient fine-tuning methods, and even outperforms the full fine-tuning in certain scenarios. We also provide a theoretical analysis towards ADePT. Code is available at https://github.com/HungerPWAY/ADePT.

ADePT: Adaptive Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

TL;DR

The paper introduces Adaptive Decomposed Prompt Tuning (ADePT), a parameter-efficient fine-tuning method that learns adaptive, per-token embedding offsets via a token-shared shallow feed-forward network, while preserving a short soft prompt. By replacing DePT's fixed, position-based offsets with a function f(e) learned for each token, ADePT achieves higher expressive power and better generalization without increasing trainable parameters. Empirical results across 23 NLP tasks and 4 PLMs show that ADePT surpasses leading PT- and DePT-based methods and, in some cases, even outperforms full fine-tuning, with competitive or faster inference. A theoretical analysis demonstrates that ADePT can alter attention patterns in the first transformer layer, yielding stronger adaptation capabilities, and code is provided for reproducibility.

Abstract

Prompt Tuning (PT) enables the adaptation of Pre-trained Large Language Models (PLMs) to downstream tasks by optimizing a small amount of soft virtual tokens, which are prepended to the input token embeddings. Recently, Decomposed Prompt Tuning (DePT) has demonstrated superior adaptation capabilities by decomposing the soft prompt into a shorter soft prompt and a pair of low-rank matrices. The product of the pair of low-rank matrices is added to the input token embeddings to offset them. Additionally, DePT achieves faster inference compared to PT due to the shorter soft prompt. However, in this paper, we find that the position-based token embedding offsets of DePT restrict its ability to generalize across diverse model inputs, and that the shared embedding offsets across many token embeddings result in sub-optimization. To tackle these issues, we introduce Adaptive Decomposed Prompt Tuning (ADePT), which is composed of a short soft prompt and a shallow token-shared feed-forward neural network. ADePT utilizes the token-shared feed-forward neural network to learn the embedding offsets for each token, enabling adaptive embedding offsets that vary according to the model input and better optimization of token embedding offsets. This enables ADePT to achieve superior adaptation performance without requiring more inference time or additional trainable parameters compared to vanilla PT and its variants. In comprehensive experiments across 23 natural language processing tasks and 4 typical PLMs of different scales, ADePT consistently surpasses the other leading parameter-efficient fine-tuning methods, and even outperforms the full fine-tuning in certain scenarios. We also provide a theoretical analysis towards ADePT. Code is available at https://github.com/HungerPWAY/ADePT.
Paper Structure (27 sections, 9 equations, 1 figure, 19 tables)

This paper contains 27 sections, 9 equations, 1 figure, 19 tables.

Figures (1)

  • Figure 1: The overview of Prompt Tuning (PT), Decomposed Prompt Tuning, and Adaptive Decomposed Prompt Tuning (ADePT). PT uses a soft prompt prepended to input token embeddings. DePT uses a short soft prompt and offsets the input token embeddings using a pair of low-rank matrices. ADePT uses a short soft prompt and offsets the input token embedding using a token-shared shallow feed-forward neural network. ADePT can adaptively give input token embedding offsets based on input tokens, while DePT can only give position-based input token embedding offsets. Moreover, the use of a short soft prompt makes DePT and ADePT faster during inference.