Table of Contents
Fetching ...

ProCut: LLM Prompt Compression via Attribution Estimation

Zhentao Xu, Fengyi Li, Albert Chen, Xiaofeng Wang

TL;DR

ProCut reframes prompt compression as a segment-level attribution problem to remove low-utility sections from large prompt templates without sacrificing task performance. It supports multiple attribution methods (SHAP, Leave-One-Out, LASSO) and introduces an LLM-driven estimator that dramatically reduces latency while preserving fidelity, enabling scalable deployment. The framework integrates with prompt-optimization tools like TextGrad to curb prompt growth while maintaining effectiveness. Across 12 tasks from five datasets and two production prompts, ProCut achieves significant token reductions (up to 84% in production) with comparable or improved performance, translating into meaningful LLM-inference cost savings in practice.

Abstract

In large-scale industrial LLM systems, prompt templates often expand to thousands of tokens as teams iteratively incorporate sections such as task instructions, few-shot examples, and heuristic rules to enhance robustness and coverage. This expansion leads to bloated prompts that are difficult to maintain and incur significant inference latency and serving costs. To address this, we introduce Prompt Compression via Attribution Estimation (ProCut), a flexible, LLM-agnostic, training-free framework that compresses prompts through attribution analysis. ProCut segments prompt templates into semantically meaningful units, quantifies their impact on task performance, and prunes low-utility components. Through extensive experiments on five public benchmark datasets and real-world industrial prompts, we show that ProCut achieves substantial prompt size reductions (78% fewer tokens in production) while maintaining or even slightly improving task performance (up to 62% better than alternative methods). We further introduce an LLM-driven attribution estimator that reduces compression latency by over 50%, and demonstrate that ProCut integrates seamlessly with existing prompt-optimization frameworks to produce concise, high-performing prompts.

ProCut: LLM Prompt Compression via Attribution Estimation

TL;DR

ProCut reframes prompt compression as a segment-level attribution problem to remove low-utility sections from large prompt templates without sacrificing task performance. It supports multiple attribution methods (SHAP, Leave-One-Out, LASSO) and introduces an LLM-driven estimator that dramatically reduces latency while preserving fidelity, enabling scalable deployment. The framework integrates with prompt-optimization tools like TextGrad to curb prompt growth while maintaining effectiveness. Across 12 tasks from five datasets and two production prompts, ProCut achieves significant token reductions (up to 84% in production) with comparable or improved performance, translating into meaningful LLM-inference cost savings in practice.

Abstract

In large-scale industrial LLM systems, prompt templates often expand to thousands of tokens as teams iteratively incorporate sections such as task instructions, few-shot examples, and heuristic rules to enhance robustness and coverage. This expansion leads to bloated prompts that are difficult to maintain and incur significant inference latency and serving costs. To address this, we introduce Prompt Compression via Attribution Estimation (ProCut), a flexible, LLM-agnostic, training-free framework that compresses prompts through attribution analysis. ProCut segments prompt templates into semantically meaningful units, quantifies their impact on task performance, and prunes low-utility components. Through extensive experiments on five public benchmark datasets and real-world industrial prompts, we show that ProCut achieves substantial prompt size reductions (78% fewer tokens in production) while maintaining or even slightly improving task performance (up to 62% better than alternative methods). We further introduce an LLM-driven attribution estimator that reduces compression latency by over 50%, and demonstrate that ProCut integrates seamlessly with existing prompt-optimization frameworks to produce concise, high-performing prompts.

Paper Structure

This paper contains 43 sections, 14 figures, 14 tables, 1 algorithm.

Figures (14)

  • Figure 1: ProCut framework overview. The process consists of three stages: segmenting the prompt template, estimating the importance of each segment via attribution analysis, and pruning low-impact segments.
  • Figure 2: Performance of compressed prompts. Grey bars show baselines, blue bars show ProCut variants, the dashed orange line indicates the brute-force oracle.
  • Figure 3: Prompt Performance vs Token Reductions.
  • Figure 4: Trade-off between attribution quality (NDCG) and computational cost (latency in seconds). The left plot shows the average quality vs. cost across all datasets, while the right plot presents results for each individual task.
  • Figure 5: Robustness of ProCut on the SQuAD dataset with noisy metrics: compression performance (F1 of compressed prompts) and attribution performance (NDCG vs. noise-free reference) remain stable, with only modest degradation under large-scale noise.
  • ...and 9 more figures