Table of Contents
Fetching ...

DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

Zhengxiang Shi, Aldo Lipani

TL;DR

This work tackles the efficiency bottleneck of Prompt Tuning (PT) by introducing Decomposed Prompt Tuning (DePT), which replaces a long soft prompt with a shorter prompt plus a pair of low-rank matrices that update embeddings, while keeping the parameter count constant. By training the shorter prompt with a larger learning rate and the low-rank updates with a smaller rate, DePT delivers competitive or superior results to PT and many PEFT baselines across 23 NLP and vision-language tasks, with substantial time and memory savings that escalate with model size. The method demonstrates strong compatibility with parameter-efficient transfer learning and remains effective in few-shot settings, suggesting practical applicability for large-scale LLMs. Overall, DePT offers a scalable, orthogonal enhancement to PT that reduces computation without sacrificing performance, making it well-suited for deployment in high-throughput, large-model contexts.

Abstract

Prompt tuning (PT), where a small amount of trainable soft (continuous) prompt vectors is affixed to the input of language models (LM), has shown promising results across various tasks and models for parameter-efficient fine-tuning (PEFT). PT stands out from other PEFT approaches because it maintains competitive performance with fewer trainable parameters and does not drastically scale up its parameters as the model size expands. However, PT introduces additional soft prompt tokens, leading to longer input sequences, which significantly impacts training and inference time and memory usage due to the Transformer's quadratic complexity. Particularly concerning for Large Language Models (LLMs) that face heavy daily querying. To address this issue, we propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt into a shorter soft prompt and a pair of low-rank matrices that are then optimised with two different learning rates. This allows DePT to achieve better performance while saving substantial memory and time costs compared to vanilla PT and its variants, without changing trainable parameter sizes. Through extensive experiments on 23 natural language processing (NLP) and vision-language (VL) tasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches, including the full fine-tuning baseline, in some scenarios. Additionally, we empirically show that DEPT grows more efficient as the model size increases. Our further study reveals that DePT integrates seamlessly with parameter-efficient transfer learning in the few-shot learning setting and highlights its adaptability to various model architectures and sizes.

DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

TL;DR

This work tackles the efficiency bottleneck of Prompt Tuning (PT) by introducing Decomposed Prompt Tuning (DePT), which replaces a long soft prompt with a shorter prompt plus a pair of low-rank matrices that update embeddings, while keeping the parameter count constant. By training the shorter prompt with a larger learning rate and the low-rank updates with a smaller rate, DePT delivers competitive or superior results to PT and many PEFT baselines across 23 NLP and vision-language tasks, with substantial time and memory savings that escalate with model size. The method demonstrates strong compatibility with parameter-efficient transfer learning and remains effective in few-shot settings, suggesting practical applicability for large-scale LLMs. Overall, DePT offers a scalable, orthogonal enhancement to PT that reduces computation without sacrificing performance, making it well-suited for deployment in high-throughput, large-model contexts.

Abstract

Prompt tuning (PT), where a small amount of trainable soft (continuous) prompt vectors is affixed to the input of language models (LM), has shown promising results across various tasks and models for parameter-efficient fine-tuning (PEFT). PT stands out from other PEFT approaches because it maintains competitive performance with fewer trainable parameters and does not drastically scale up its parameters as the model size expands. However, PT introduces additional soft prompt tokens, leading to longer input sequences, which significantly impacts training and inference time and memory usage due to the Transformer's quadratic complexity. Particularly concerning for Large Language Models (LLMs) that face heavy daily querying. To address this issue, we propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt into a shorter soft prompt and a pair of low-rank matrices that are then optimised with two different learning rates. This allows DePT to achieve better performance while saving substantial memory and time costs compared to vanilla PT and its variants, without changing trainable parameter sizes. Through extensive experiments on 23 natural language processing (NLP) and vision-language (VL) tasks, we demonstrate that DePT outperforms state-of-the-art PEFT approaches, including the full fine-tuning baseline, in some scenarios. Additionally, we empirically show that DEPT grows more efficient as the model size increases. Our further study reveals that DePT integrates seamlessly with parameter-efficient transfer learning in the few-shot learning setting and highlights its adaptability to various model architectures and sizes.
Paper Structure (40 sections, 3 equations, 6 figures, 10 tables)

This paper contains 40 sections, 3 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: The overview of Fine Tuning (FT), Prompt Tuning (PT), and Prompting Engineering. PT increases the length of the input sequence, leading to much greater computational demands during train and inference phrases.
  • Figure 2: The overview of the PETL framework (Top) and our method DePT (Bottom). DePT decomposes a trainable soft prompt of the vanilla PT into a shorter soft prompt and a couple of low-rank matrices, where the multiplication of low-rank matrices serves to update frozen word embedding.
  • Figure 3: Performance on the GLUE benchmark for different soft prompt lengths $m$ in DePT, associated with corresponding relative train time and memory cost. The time and memory are averaged over different model sizes using batch size as 16. DePT consistently uses the same number of trainable parameters as the vanilla PT ($m$=100).
  • Figure 4: Average inference speed on GLUE benchmark using varying soft prompt length $m$ and the rank of low-rank matrices $r$, keeping the total number of trainable parameters constant. Small texts in blue indicate the speed relative to the vanilla PT (represented by brown) ($m$=100).
  • Figure 5: Test results on GLUE benchmark using T5-Base, showing the importance of training DePT with different learning rates.
  • ...and 1 more figures