Table of Contents
Fetching ...

ICPC: In-context Prompt Compression with Faster Inference

Ziyang Yu, Yuyu Liu

TL;DR

ICPC addresses the challenge of feeding long prompts to LLMs by introducing an encoder-based prompt compression method that adaptively removes redundant content. It formalizes an information-loss objective, using participle units and a percentile-thresholding mechanism to preserve essential meaning while reducing prompt length, thereby enabling faster inference without extra LLM usage. Through extensive experiments across Wikipedia, arXiv, and Reddit data with multiple encoders (e.g., BERT, RoBERTa, XLNet, ALBERT, T5, DeBERTa), ICPC demonstrates improved task metrics (BLEU, ROUGE, BERTScore) and reduced compression time, with maintained readability and scalability for very long texts. The approach offers a practical, model-agnostic solution for efficient long-context processing in NLP applications and holds potential for integration into real-world AI systems requiring fast, concise prompts.

Abstract

Despite the recent success of Large Language Models (LLMs), it remains challenging to feed LLMs with long prompts due to the fixed size of LLM inputs. As a remedy, prompt compression becomes a promising solution by removing redundant tokens in the prompt. However, using LLM in the existing works requires additional computation resources and leads to memory overheads. To address it, we propose ICPC (In-context Prompt Compression), a novel and scalable prompt compression method that adaptively reduces the prompt length. The key idea of ICPC is to calculate the probability of each word appearing in the prompt using encoders and calculate information carried by each word through the information function, which effectively reduces the information loss during prompt compression and increases the speed of compression. Empirically, we demonstrate that ICPC can effectively compress long texts of different categories and thus achieve better performance and speed on different types of NLP tasks.

ICPC: In-context Prompt Compression with Faster Inference

TL;DR

ICPC addresses the challenge of feeding long prompts to LLMs by introducing an encoder-based prompt compression method that adaptively removes redundant content. It formalizes an information-loss objective, using participle units and a percentile-thresholding mechanism to preserve essential meaning while reducing prompt length, thereby enabling faster inference without extra LLM usage. Through extensive experiments across Wikipedia, arXiv, and Reddit data with multiple encoders (e.g., BERT, RoBERTa, XLNet, ALBERT, T5, DeBERTa), ICPC demonstrates improved task metrics (BLEU, ROUGE, BERTScore) and reduced compression time, with maintained readability and scalability for very long texts. The approach offers a practical, model-agnostic solution for efficient long-context processing in NLP applications and holds potential for integration into real-world AI systems requiring fast, concise prompts.

Abstract

Despite the recent success of Large Language Models (LLMs), it remains challenging to feed LLMs with long prompts due to the fixed size of LLM inputs. As a remedy, prompt compression becomes a promising solution by removing redundant tokens in the prompt. However, using LLM in the existing works requires additional computation resources and leads to memory overheads. To address it, we propose ICPC (In-context Prompt Compression), a novel and scalable prompt compression method that adaptively reduces the prompt length. The key idea of ICPC is to calculate the probability of each word appearing in the prompt using encoders and calculate information carried by each word through the information function, which effectively reduces the information loss during prompt compression and increases the speed of compression. Empirically, we demonstrate that ICPC can effectively compress long texts of different categories and thus achieve better performance and speed on different types of NLP tasks.
Paper Structure (24 sections, 5 equations, 1 figure, 5 tables)

This paper contains 24 sections, 5 equations, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Texts before and after compression. Yellow represents words with higher importance. Up: text before compression. Down: text after compression.