Table of Contents
Fetching ...

Prompt Compression for Large Language Models: A Survey

Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier

TL;DR

Prompt compression for LLMs categorizes into hard prompts (token removal/paraphrase) and soft prompts (learned continuous embeddings), aiming to reduce prompt length without sacrificing performance. The paper surveys architectures, mechanisms, and downstream adaptations, including attention-based interpretations and PEFT connections, and discusses limitations such as information loss and modest speedups. It highlights future directions like encoder optimization, hybrid hard-soft approaches, and leveraging multimodal LLM insights. The goal is to guide researchers and practitioners in designing more efficient prompting strategies for large-scale models.

Abstract

Leveraging large language models (LLMs) for complex natural language tasks typically requires long-form prompts to convey detailed requirements and information, which results in increased memory usage and inference costs. To mitigate these challenges, multiple efficient methods have been proposed, with prompt compression gaining significant research interest. This survey provides an overview of prompt compression techniques, categorized into hard prompt methods and soft prompt methods. First, the technical approaches of these methods are compared, followed by an exploration of various ways to understand their mechanisms, including the perspectives of attention optimization, Parameter-Efficient Fine-Tuning (PEFT), modality integration, and new synthetic language. We also examine the downstream adaptations of various prompt compression techniques. Finally, the limitations of current prompt compression methods are analyzed, and several future directions are outlined, such as optimizing the compression encoder, combining hard and soft prompts methods, and leveraging insights from multimodality.

Prompt Compression for Large Language Models: A Survey

TL;DR

Prompt compression for LLMs categorizes into hard prompts (token removal/paraphrase) and soft prompts (learned continuous embeddings), aiming to reduce prompt length without sacrificing performance. The paper surveys architectures, mechanisms, and downstream adaptations, including attention-based interpretations and PEFT connections, and discusses limitations such as information loss and modest speedups. It highlights future directions like encoder optimization, hybrid hard-soft approaches, and leveraging multimodal LLM insights. The goal is to guide researchers and practitioners in designing more efficient prompting strategies for large-scale models.

Abstract

Leveraging large language models (LLMs) for complex natural language tasks typically requires long-form prompts to convey detailed requirements and information, which results in increased memory usage and inference costs. To mitigate these challenges, multiple efficient methods have been proposed, with prompt compression gaining significant research interest. This survey provides an overview of prompt compression techniques, categorized into hard prompt methods and soft prompt methods. First, the technical approaches of these methods are compared, followed by an exploration of various ways to understand their mechanisms, including the perspectives of attention optimization, Parameter-Efficient Fine-Tuning (PEFT), modality integration, and new synthetic language. We also examine the downstream adaptations of various prompt compression techniques. Finally, the limitations of current prompt compression methods are analyzed, and several future directions are outlined, such as optimizing the compression encoder, combining hard and soft prompts methods, and leveraging insights from multimodality.

Paper Structure

This paper contains 12 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustrative examples of prompt compression methods. Hard prompt methods remove low-information tokens or paraphrase for conciseness. Soft prompt methods compress text into a smaller number of special tokens, $<c_n>$. The grids below visualize attention patterns, where the y-axis represents the sequence of tokens, and the x-axis shows the tokens they attend to. In the original prompt, each token attends to all previous tokens. In hard prompts, each token cannot attend to previous deleted tokens ($D_i$). In soft prompts, after the compressed token ($C_i$) attends to all prior input tokens ($I_i$), subsequent output tokens ($O_i$) cannot attend to tokens before the compressed token.
  • Figure 2: Hierarchical overview of prompt compression methods and their downstream adaptions.
  • Figure 3: Architectures for various prompt compression models by hard prompt methods.
  • Figure 4: Architectures for various prompt compression models by soft prompt methods.