Table of Contents
Fetching ...

Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability

Tsz Ting Chung, Leyang Cui, Lemao Liu, Xinting Huang, Shuming Shi, Dit-Yan Yeung

TL;DR

This paper investigates the ability of LLMs to develop a unified compression method that discretizes uninformative tokens, utilizing a self-supervised pre-training technique and analyzes how Selection-p helps maintain performance on in-context learning with long contexts.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in a wide range of natural language processing tasks when leveraging in-context learning. To mitigate the additional computational and financial costs associated with in-context learning, several prompt compression methods have been proposed to compress the in-context learning prompts. Despite their success, these methods face challenges with transferability due to model-specific compression, or rely on external training data, such as GPT-4. In this paper, we investigate the ability of LLMs to develop a unified compression method that discretizes uninformative tokens, utilizing a self-supervised pre-training technique. By introducing a small number of parameters during the continual pre-training, the proposed Selection-p produces a probability for each input token, indicating whether to preserve or discard it. Experiments show Selection-p achieves state-of-the-art performance across numerous classification tasks, achieving compression rates of up to 10 times while experiencing only a marginal 0.8% decrease in performance. Moreover, it exhibits superior transferability to different models compared to prior work. Additionally, we further analyze how Selection-p helps maintain performance on in-context learning with long contexts.

Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability

TL;DR

This paper investigates the ability of LLMs to develop a unified compression method that discretizes uninformative tokens, utilizing a self-supervised pre-training technique and analyzes how Selection-p helps maintain performance on in-context learning with long contexts.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in a wide range of natural language processing tasks when leveraging in-context learning. To mitigate the additional computational and financial costs associated with in-context learning, several prompt compression methods have been proposed to compress the in-context learning prompts. Despite their success, these methods face challenges with transferability due to model-specific compression, or rely on external training data, such as GPT-4. In this paper, we investigate the ability of LLMs to develop a unified compression method that discretizes uninformative tokens, utilizing a self-supervised pre-training technique. By introducing a small number of parameters during the continual pre-training, the proposed Selection-p produces a probability for each input token, indicating whether to preserve or discard it. Experiments show Selection-p achieves state-of-the-art performance across numerous classification tasks, achieving compression rates of up to 10 times while experiencing only a marginal 0.8% decrease in performance. Moreover, it exhibits superior transferability to different models compared to prior work. Additionally, we further analyze how Selection-p helps maintain performance on in-context learning with long contexts.

Paper Structure

This paper contains 36 sections, 4 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Illustration with the training process. Areas in orange are learnable parameters. For the input context $[x_1,x_2,\dots,x_{n-1}]$, inference without parameters update is performed first to create the attention mask $\bar{p}$. These subsequently form the model input for LoRA training and updating the parameters of the additional linear layer.
  • Figure 2: Spearman's Rank Correlation Coefficient (spearmanr) between p (p) value, mean attention (a) and token-level perplexity (ppl) across different traditional classification tasks.
  • Figure 3: Analysis of the token preservation percentage with respect to different types of Part-of-Speech tags under 10x compression rate.
  • Figure 4: Illustration of the compression result by Selection-$p$ for Subj and WSC tasks under 10x compression rate. Compression is performed with 19 demonstrations for Subj while it is performed with 16 demonstrations for WSC with total sum of about 750 tokens respectively.
  • Figure 5: Illustration of the compression result by Selection-$p$ for BANKING77 under 10x compression rate. Compression is performed with 27 demonstrations with total sum of about 750 tokens.