Table of Contents
Fetching ...

Characterizing Prompt Compression Methods for Long Context Inference

Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami

TL;DR

The paper tackles the challenge of long-context inference by systematically evaluating prompt compression methods, categorized into extractive, abstractive, and token pruning, with a further split into query-aware versus query-agnostic. Using LongBench across single-document QA, multi-document QA, and summarization, it finds that extractive compression generally delivers the best performance and can achieve substantial compression with minimal accuracy loss, while token pruning often underperforms and abstractive summarization offers only marginal gains in many scenarios. The study highlights the superiority of reranker-based extractive methods over simple retrieval and emphasizes the benefits of query-aware abstractive compression when strong prompting or summarizers are used. The findings provide practical guidance for deploying long-context LLMs, suggesting that practitioners should prioritize extractive compression with well-tuned rerankers, carefully choose chunk sizes, and consider query-aware strategies for complex tasks such as Text-to-SQL and cross-document reasoning.

Abstract

Long context inference presents challenges at the system level with increased compute and memory requirements, as well as from an accuracy perspective in being able to reason over long contexts. Recently, several methods have been proposed to compress the prompt to reduce the context length. However, there has been little work on comparing the different proposed methods across different tasks through a standardized analysis. This has led to conflicting results. To address this, here we perform a comprehensive characterization and evaluation of different prompt compression methods. In particular, we analyze extractive compression, summarization-based abstractive compression, and token pruning methods. Surprisingly, we find that extractive compression often outperforms all the other approaches, and enables up to 10x compression with minimal accuracy degradation. Interestingly, we also find that despite several recent claims, token pruning methods often lag behind extractive compression. We only found marginal improvements on summarization tasks.

Characterizing Prompt Compression Methods for Long Context Inference

TL;DR

The paper tackles the challenge of long-context inference by systematically evaluating prompt compression methods, categorized into extractive, abstractive, and token pruning, with a further split into query-aware versus query-agnostic. Using LongBench across single-document QA, multi-document QA, and summarization, it finds that extractive compression generally delivers the best performance and can achieve substantial compression with minimal accuracy loss, while token pruning often underperforms and abstractive summarization offers only marginal gains in many scenarios. The study highlights the superiority of reranker-based extractive methods over simple retrieval and emphasizes the benefits of query-aware abstractive compression when strong prompting or summarizers are used. The findings provide practical guidance for deploying long-context LLMs, suggesting that practitioners should prioritize extractive compression with well-tuned rerankers, carefully choose chunk sizes, and consider query-aware strategies for complex tasks such as Text-to-SQL and cross-document reasoning.

Abstract

Long context inference presents challenges at the system level with increased compute and memory requirements, as well as from an accuracy perspective in being able to reason over long contexts. Recently, several methods have been proposed to compress the prompt to reduce the context length. However, there has been little work on comparing the different proposed methods across different tasks through a standardized analysis. This has led to conflicting results. To address this, here we perform a comprehensive characterization and evaluation of different prompt compression methods. In particular, we analyze extractive compression, summarization-based abstractive compression, and token pruning methods. Surprisingly, we find that extractive compression often outperforms all the other approaches, and enables up to 10x compression with minimal accuracy degradation. Interestingly, we also find that despite several recent claims, token pruning methods often lag behind extractive compression. We only found marginal improvements on summarization tasks.
Paper Structure (36 sections, 17 figures, 5 tables)

This paper contains 36 sections, 17 figures, 5 tables.

Figures (17)

  • Figure 1: LLM context length has been rapidly increasing as many applications can benefit from longer context lengths. However, this often comes with accuracy challenges as LLMs seem to struggle with reasoning over long context lengths, along with higher cost and time to first token.
  • Figure 2: An illustration of different prompt compression methods. Token pruning methods like LongLLMLingua pan2024llmlingua2, Selective-Context li2023compressing, and PCRL jung2023discrete perform compression by discarding irrelevant tokens. Abstractive compression methods like Prompt-SAW ali2024prompt, RECOMP, and PRCA yang2023prca generate summaries by synthesizing information. Extractive compression methods like RECOMP xu2023recomp and reranker-based compression select documents, sentences, or phrases from the original context without altering them. In this example, each of the methods compresses the original context while keeping the necessary information to determine the book's author.
  • Figure 3: An illustration of query-aware and query-agnostic compression applied to a document in the prompt context. With query-aware compression, the compressed context of the document changes based on the user's specific query, presenting a tailored version each time. Conversely, query-agnostic compression maintains a consistent compressed context of the document, irrespective of the query presented.
  • Figure 4: Results of main methods with GPT-3.5-Turbo. For each dataset, the corresponding graphs plot the accuracy metric—either F1 or Rouge-L—against the compression rate. We see similar results with Mixtral 8x7B (see \ref{['fig:mixtral-pareto']}) and DBRX Instruct (see \ref{['fig:dbrx-pareto']}).
  • Figure 5: Analysis of performing extractive compression using standard retrieval over embedding space compared to reranking. For retrieval, embeddings are produced using text-embedding-3-small. GPT-3.5-Turbo is used as the LLM. Results on all nine datasets are shown in \ref{['fig:retriever-gpt.3.5-full']}.
  • ...and 12 more figures