Characterizing Prompt Compression Methods for Long Context Inference
Siddharth Jha, Lutfi Eren Erdogan, Sehoon Kim, Kurt Keutzer, Amir Gholami
TL;DR
The paper tackles the challenge of long-context inference by systematically evaluating prompt compression methods, categorized into extractive, abstractive, and token pruning, with a further split into query-aware versus query-agnostic. Using LongBench across single-document QA, multi-document QA, and summarization, it finds that extractive compression generally delivers the best performance and can achieve substantial compression with minimal accuracy loss, while token pruning often underperforms and abstractive summarization offers only marginal gains in many scenarios. The study highlights the superiority of reranker-based extractive methods over simple retrieval and emphasizes the benefits of query-aware abstractive compression when strong prompting or summarizers are used. The findings provide practical guidance for deploying long-context LLMs, suggesting that practitioners should prioritize extractive compression with well-tuned rerankers, carefully choose chunk sizes, and consider query-aware strategies for complex tasks such as Text-to-SQL and cross-document reasoning.
Abstract
Long context inference presents challenges at the system level with increased compute and memory requirements, as well as from an accuracy perspective in being able to reason over long contexts. Recently, several methods have been proposed to compress the prompt to reduce the context length. However, there has been little work on comparing the different proposed methods across different tasks through a standardized analysis. This has led to conflicting results. To address this, here we perform a comprehensive characterization and evaluation of different prompt compression methods. In particular, we analyze extractive compression, summarization-based abstractive compression, and token pruning methods. Surprisingly, we find that extractive compression often outperforms all the other approaches, and enables up to 10x compression with minimal accuracy degradation. Interestingly, we also find that despite several recent claims, token pruning methods often lag behind extractive compression. We only found marginal improvements on summarization tasks.
