Table of Contents
Fetching ...

Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models

Alliot Nagle, Adway Girish, Marco Bondaschi, Michael Gastpar, Ashok Vardhan Makkuva, Hyeji Kim

TL;DR

This work formulates prompt compression for black-box LLMs as a rate-distortion problem, introducing the distortion-rate function $D^*(R)$ as the fundamental limit and deriving a dual linear-program to compute it. It distinguishes query-agnostic and query-aware compression, develops a practical algorithm to solve the dual RD problem, and demonstrates that query-awareness substantially narrows the gap to the theoretical limit. On synthetic data, the proposed Adaptive QuerySelect achieves the best performance and often matches or outperforms the optimal query-aware strategy, highlighting the value of variable-rate compression. Experiments on small natural-language datasets and beam-search-based approximations for larger data validate the framework and show meaningful gains from query-aware prompting, with implications for reducing prompt length while preserving downstream performance.

Abstract

We formalize the problem of prompt compression for large language models (LLMs) and present a framework to unify token-level prompt compression methods which create hard prompts for black-box models. We derive the distortion-rate function for this setup as a linear program, and provide an efficient algorithm to compute this fundamental limit via the dual of the linear program. Using the distortion-rate function as the baseline, we study the performance of existing compression schemes on a synthetic dataset consisting of prompts generated from a Markov chain, natural language queries, and their respective answers. Our empirical analysis demonstrates the criticality of query-aware prompt compression, where the compressor has knowledge of the downstream task/query for the black-box LLM. We show that there is a large gap between the performance of current prompt compression methods and the optimal strategy, and propose Adaptive QuerySelect, a query-aware, variable-rate adaptation of a prior work to close the gap. We extend our experiments to a small natural language dataset to further confirm our findings on our synthetic dataset.

Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models

TL;DR

This work formulates prompt compression for black-box LLMs as a rate-distortion problem, introducing the distortion-rate function as the fundamental limit and deriving a dual linear-program to compute it. It distinguishes query-agnostic and query-aware compression, develops a practical algorithm to solve the dual RD problem, and demonstrates that query-awareness substantially narrows the gap to the theoretical limit. On synthetic data, the proposed Adaptive QuerySelect achieves the best performance and often matches or outperforms the optimal query-aware strategy, highlighting the value of variable-rate compression. Experiments on small natural-language datasets and beam-search-based approximations for larger data validate the framework and show meaningful gains from query-aware prompting, with implications for reducing prompt length while preserving downstream performance.

Abstract

We formalize the problem of prompt compression for large language models (LLMs) and present a framework to unify token-level prompt compression methods which create hard prompts for black-box models. We derive the distortion-rate function for this setup as a linear program, and provide an efficient algorithm to compute this fundamental limit via the dual of the linear program. Using the distortion-rate function as the baseline, we study the performance of existing compression schemes on a synthetic dataset consisting of prompts generated from a Markov chain, natural language queries, and their respective answers. Our empirical analysis demonstrates the criticality of query-aware prompt compression, where the compressor has knowledge of the downstream task/query for the black-box LLM. We show that there is a large gap between the performance of current prompt compression methods and the optimal strategy, and propose Adaptive QuerySelect, a query-aware, variable-rate adaptation of a prior work to close the gap. We extend our experiments to a small natural language dataset to further confirm our findings on our synthetic dataset.
Paper Structure (46 sections, 4 theorems, 28 equations, 17 figures, 6 tables, 2 algorithms)

This paper contains 46 sections, 4 theorems, 28 equations, 17 figures, 6 tables, 2 algorithms.

Key Result

Proposition 1

The distortion-rate function for query-agnostic prompt compression eqn: RD_prompt_explicit is given by the solution to the linear program where for each $x \in \mathcal{X}$, $\mathcal{M}_x$ denotes the set of compressed prompts associated to $x$, i.e., the set of all possible token sequences of length smaller than $\mathrm{len}(x)$, the vectors $\boldsymbol{z}_x\in \mathbb{R}_+^{\mathcal{M}_x}$ a

Figures (17)

  • Figure 1: The distortion-rate trade-off of all prompt compression methods compared to the query-aware and query-agnostic theoretical limits on a synthetic dataset with binary prompts. All distortions are computed with the log loss (left) and 0/1 loss (right) distortion metrics formally defined in \ref{['eqn: distortions']}. We observe that (1) most existing methods are far from the theoretical limit, suggesting that there is still room for improvement in this field, (2) conditioning on the query allows for a significant improvement, as seen by the performance of the query-aware method QuerySelect against the query-agnostic LLMLingua-2 wu2024llmlingua2, and (3) our proposed method Adaptive QuerySelect, a query-aware and variable-rate adaptation of LLMLingua-2, achieves the best performance among all methods considered, and is the only method to outperform the optimal query-agnostic strategy.
  • Figure 2: Model for prompt compression in LLMs. (a): Without prompt compression, the LLM takes a long Prompt and Query as input, and produces an Output distribution. (b) and (c): The prompt is passed through a compressor to obtain a shorter Compressed prompt and the LLM takes this compressed prompt and query as input instead. (b) The compressor does not have access to the query, and preserves all highlighted tokens. (c) The compressor has access to the query, and preserves only the tokens highlighted in orange.
  • Figure 3: Geometric intuition for solving \ref{['eqn: RD_prompt_dual_LP']}: lower-left convex envelope for an example $\{(\boldsymbol{R}_{x, m}, \boldsymbol{D}_{x,m})\}_{m \in \mathcal{M}_x}$ for a fixed $x$ with $|\mathcal{M}_x| = 11$, $k_x = 3$.
  • Figure 4: We highlight the distortion-rate curves for two of the seven queries in the validation partition of our synthetic dataset. Our method, Adaptive QuerySelect, is able to match the performance of the optimal query-aware strategy (left). Some queries naturally incur less distortion than others with the target LLM, even with a query-agnostic approach, if the query is aligned well with the data generation process for the prompt (right). Note that QuerySelect covers the line of LLMLingua-2 as their performance is identical for this query.
  • Figure 5: Query-agnostic distortion-rate curves plotted for log loss and 0/1 loss distortion measures. The curves marked with a 'diamond' are computed using all possible shorter sequences, while those marked with an '$\times$' are computed using only pruned versions of the original prompt. They are nearly identical, which suggests that a good approximation to the optimal distortion-rate curve can be obtained by considering pruned prompts only.
  • ...and 12 more figures

Theorems & Definitions (9)

  • Proposition 1: Primal LP
  • proof
  • Theorem 1: Dual LP
  • Proposition 2: Query-aware primal LPs
  • proof
  • Remark 1
  • Theorem 2: Query-aware dual LPs
  • proof
  • proof : Proof of thm: dual_lp