Table of Contents
Fetching ...

PIS: Linking Importance Sampling and Attention Mechanisms for Efficient Prompt Compression

Lizhe Chen, Binjia Zhou, Yuyao Ge, Jiayi Chen, Shiguang NI

TL;DR

Prompt Importance Sampling (PIS) tackles the cost of large language models by introducing a measure-theoretically grounded, dual-level prompt compression framework that uses attention-based token saliency for token pruning and Russian roulette for sentence pruning. A lightweight 9-layer reinforcement learning policy adapts per-sentence compression, while an encoder-based attention analysis informs token importance with a TF-IDF correction to preserve key terms. Empirical results across multiple domains show improved compression quality and reduced latency, with robust performance in out-of-domain tasks and notable gains in downstream accuracy when using compressed prompts. This work advances resource-efficient LLM deployment by aligning compression decisions with intrinsic model mechanisms rather than relying on external generation models, and it provides a foundation for further exploration of model-aware prompt management.

Abstract

Large language models (LLMs) have achieved remarkable progress, demonstrating unprecedented capabilities across various natural language processing tasks. However, the high costs associated with such exceptional performance limit the widespread adoption of LLMs, highlighting the need for prompt compression. Existing prompt compression methods primarily rely on heuristic truncation or abstractive summarization techniques, which fundamentally overlook the intrinsic mechanisms of LLMs and lack a systematic evaluation of token importance for generation. In this work, we introduce Prompt Importance Sampling (PIS), a novel compression framework that dynamically compresses prompts by sampling important tokens based on the analysis of attention scores of hidden states. PIS employs a dual-level compression mechanism: 1) at the token level, we quantify saliency using LLM-native attention scores and implement adaptive compression through a lightweight 9-layer reinforcement learning (RL) network; 2) at the semantic level, we propose a Russian roulette sampling strategy for sentence-level importance sampling. Comprehensive evaluations across multiple domain benchmarks demonstrate that our method achieves state-of-the-art compression performance. Notably, our framework serendipitously enhances reasoning efficiency through optimized context structuring. This work advances prompt engineering by offering both theoretical grounding and practical efficiency in context management for LLMs.

PIS: Linking Importance Sampling and Attention Mechanisms for Efficient Prompt Compression

TL;DR

Prompt Importance Sampling (PIS) tackles the cost of large language models by introducing a measure-theoretically grounded, dual-level prompt compression framework that uses attention-based token saliency for token pruning and Russian roulette for sentence pruning. A lightweight 9-layer reinforcement learning policy adapts per-sentence compression, while an encoder-based attention analysis informs token importance with a TF-IDF correction to preserve key terms. Empirical results across multiple domains show improved compression quality and reduced latency, with robust performance in out-of-domain tasks and notable gains in downstream accuracy when using compressed prompts. This work advances resource-efficient LLM deployment by aligning compression decisions with intrinsic model mechanisms rather than relying on external generation models, and it provides a foundation for further exploration of model-aware prompt management.

Abstract

Large language models (LLMs) have achieved remarkable progress, demonstrating unprecedented capabilities across various natural language processing tasks. However, the high costs associated with such exceptional performance limit the widespread adoption of LLMs, highlighting the need for prompt compression. Existing prompt compression methods primarily rely on heuristic truncation or abstractive summarization techniques, which fundamentally overlook the intrinsic mechanisms of LLMs and lack a systematic evaluation of token importance for generation. In this work, we introduce Prompt Importance Sampling (PIS), a novel compression framework that dynamically compresses prompts by sampling important tokens based on the analysis of attention scores of hidden states. PIS employs a dual-level compression mechanism: 1) at the token level, we quantify saliency using LLM-native attention scores and implement adaptive compression through a lightweight 9-layer reinforcement learning (RL) network; 2) at the semantic level, we propose a Russian roulette sampling strategy for sentence-level importance sampling. Comprehensive evaluations across multiple domain benchmarks demonstrate that our method achieves state-of-the-art compression performance. Notably, our framework serendipitously enhances reasoning efficiency through optimized context structuring. This work advances prompt engineering by offering both theoretical grounding and practical efficiency in context management for LLMs.

Paper Structure

This paper contains 31 sections, 12 equations, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Compared to methods that train a specific model based on the LLM input and its compressed version, our approach achieves ultra-low-loss and high-performance compression through importance sampling at two levels, which only need a compact 9-layer RL policy network.
  • Figure 2: Pre-experiment setup: Evaluating the impact of randomly adding words to a prompt. The original question “Which is larger, 9.9 or 9.11?” is modified by inserting varying numbers of random words at random positions within the prompt. The results show that as more random words are added, the correctness of the LLM's responses decreases, with nearly complete failure in answering correctly after adding 10 random words.
  • Figure 3: Our method first applies token-level importance sampling, followed by sentence-level sampling using the Russian roulette technique, minimizing redundancy to generate the most concise prompt.
  • Figure 4: DDQN-based adaptive compression ratio selection. The model takes a encoder-only model embeddings as input states and outputs Q-values for candidate compression ratios. DDQN balances effectiveness and efficiency.