Table of Contents
Fetching ...

Dynamic Compressing Prompts for Efficient Inference of Large Language Models

Jinwu Hu, Wei Zhang, Yufeng Wang, Yu Hu, Bin Xiao, Mingkui Tan, Qing Du

TL;DR

This paper tackles the cost and efficiency challenges of prompting large language models by introducing Dynamic Compressing Prompts (LLM-DCP). It reframes prompt compression as a Markov Decision Process and employs Hierarchical Prompt Compression to learn compact yet information-preserving prompts without relying on black box LLM rewards. A novel reward function balances compression rate, output fidelity, and retention of key content, while distribution alignment via instruction tuning enables training without external LLM feedback. The approach achieves notable gains over state-of-the-art baselines, particularly at higher compression rates, and demonstrates strong cross task generalization with practical training improvements. The work presents a scalable, task-agnostic framework for efficient LLM inference with substantial practical impact for cost-sensitive deployments.

Abstract

Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can hinder performance because of the limited context windows of LLMs. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. To tackle these issues, we propose a task-agnostic method called Dynamic Compressing Prompts (LLM-DCP). Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible. We model prompt compression as a Markov Decision Process (MDP), enabling the DCP-Agent to sequentially remove redundant tokens by adapting to dynamic contexts and retaining crucial content. We develop a reward function for training the DCP-Agent that balances the compression rate, the quality of the LLM output, and the retention of key information. This allows for prompt token reduction without needing an external black-box LLM. Inspired by the progressive difficulty adjustment in curriculum learning, we introduce a Hierarchical Prompt Compression (HPC) training strategy that gradually increases the compression difficulty, enabling the DCP-Agent to learn an effective compression method that maintains information integrity. Experiments demonstrate that our method outperforms state-of-the-art techniques, especially at higher compression rates. The code for our approach will be available at https://github.com/Fhujinwu/DCP.

Dynamic Compressing Prompts for Efficient Inference of Large Language Models

TL;DR

This paper tackles the cost and efficiency challenges of prompting large language models by introducing Dynamic Compressing Prompts (LLM-DCP). It reframes prompt compression as a Markov Decision Process and employs Hierarchical Prompt Compression to learn compact yet information-preserving prompts without relying on black box LLM rewards. A novel reward function balances compression rate, output fidelity, and retention of key content, while distribution alignment via instruction tuning enables training without external LLM feedback. The approach achieves notable gains over state-of-the-art baselines, particularly at higher compression rates, and demonstrates strong cross task generalization with practical training improvements. The work presents a scalable, task-agnostic framework for efficient LLM inference with substantial practical impact for cost-sensitive deployments.

Abstract

Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can hinder performance because of the limited context windows of LLMs. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. To tackle these issues, we propose a task-agnostic method called Dynamic Compressing Prompts (LLM-DCP). Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible. We model prompt compression as a Markov Decision Process (MDP), enabling the DCP-Agent to sequentially remove redundant tokens by adapting to dynamic contexts and retaining crucial content. We develop a reward function for training the DCP-Agent that balances the compression rate, the quality of the LLM output, and the retention of key information. This allows for prompt token reduction without needing an external black-box LLM. Inspired by the progressive difficulty adjustment in curriculum learning, we introduce a Hierarchical Prompt Compression (HPC) training strategy that gradually increases the compression difficulty, enabling the DCP-Agent to learn an effective compression method that maintains information integrity. Experiments demonstrate that our method outperforms state-of-the-art techniques, especially at higher compression rates. The code for our approach will be available at https://github.com/Fhujinwu/DCP.

Paper Structure

This paper contains 24 sections, 10 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Motivation for Prompt Compression of LLMs.
  • Figure 2: General diagram of proposed LLM-DCP. We model prompt compression as a Markov Decision Process (MDP) and train a DCP-Agent to determine an optimal compression pathway. The input prompt represented as a token sequence serves as the initial state of the MDP. At time step $t$, the DCP-Agent performs the action to select specific tokens to retain or discard, yielding a compressed token sequence as the next state $s_{t+1}$. Then the reward is calculated according to Eq. (\ref{['equation: rt']}). Our designed hierarchical prompt compression (HPC) training strategy collects the trajectory, which is applied to train the DCP-Agent. This process iterates until reaching the max trajectory length. The final token sequence is decoded into compressed text, with a much lower token number without affecting the output performance as much as possible.
  • Figure 3: Cases study on GSM8K dataset in 1-shot constraint. The red highlights the words that are preserved. The strikethrough highlights the words that are removed.
  • Figure 4: Experimental results for different values of $\psi$ on the GSM8K dataset with 1-shot constraint.