Table of Contents
Fetching ...

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

Cornelius Kummer, Lena Jurkschat, Michael Färber, Sahar Vahdati

Abstract

With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30,000 queries across several open-source LLMs and three GPU classes. Our evaluation separates compression overhead from decoding latency while tracking output quality and memory usage. LLMLingua achieves up to 18% end-to-end speed-ups, when prompt length, compression ratio, and hardware capacity are well matched, with response quality remaining statistically unchanged across summarization, code generation, and question answering tasks. Outside this operating window, however, the compression step dominates and cancels out the gains. We also show that effective compression can reduce memory usage enough to offload workloads from data center GPUs to commodity cards, with only a 0.3s increase in latency. Our open-source profiler predicts the latency break-even point for each model-hardware setup, providing practical guidance on when prompt compression delivers real-world benefits.

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

Abstract

With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute increase. Prompt compression, which reduces the size of input prompts while aiming to preserve performance on downstream tasks, has established itself as a cost-effective and low-latency method for accelerating inference in large language models. However, its usefulness depends on whether the additional preprocessing time during generation is offset by faster decoding. We present the first systematic, large-scale study of this trade-off, with thousands of runs and 30,000 queries across several open-source LLMs and three GPU classes. Our evaluation separates compression overhead from decoding latency while tracking output quality and memory usage. LLMLingua achieves up to 18% end-to-end speed-ups, when prompt length, compression ratio, and hardware capacity are well matched, with response quality remaining statistically unchanged across summarization, code generation, and question answering tasks. Outside this operating window, however, the compression step dominates and cancels out the gains. We also show that effective compression can reduce memory usage enough to offload workloads from data center GPUs to commodity cards, with only a 0.3s increase in latency. Our open-source profiler predicts the latency break-even point for each model-hardware setup, providing practical guidance on when prompt compression delivers real-world benefits.

Paper Structure

This paper contains 23 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Compression latency in dependence on the compression ratio (a) and on the executing hardware (b). With LLMLingua-2, latency is independent of the compression ratio and reduced to max. $\sim$3s for the longest possible prompts ($48K$). Compression latency and model inference time percentage of LLMLingua-2 (left) and the LLMLingua-2-small variant (right) in dependence of compression hardware for a prompt size of 4,000 tokens.
  • Figure 2: Total prompt compression latency of the small LLMLingua variants under increasing prompt size, using a compression rate of $0.5$ and an Nvidia A100 GPU. Model inference latency as a percentage of the overall compression latency is shown in yellow.
  • Figure 3: Speed-up for the generation of a single token (Time to First Token) for all tested target models under prompt compression with LLMLingua-2 on an Nvidia A100 GPU. The compression ratio of 1 marks the baseline, meaning no compression was applied to the prompt.
  • Figure 4: Response quality of LLMLingua-2 compressed LongBench prompts, using different LLMs, compared to the uncompressed baseline dissected into task types.
  • Figure 5: Target compression rate adherence for LLMLingua in dependence on prompt length, compared to a perfect compression, which matches the given compression rate. The compression model does not achieve the given compression rate, which leads to unpredictable API costs, latency and quality.