Dynamic Compressing Prompts for Efficient Inference of Large Language Models

Jinwu Hu; Wei Zhang; Yufeng Wang; Yu Hu; Bin Xiao; Mingkui Tan; Qing Du

Dynamic Compressing Prompts for Efficient Inference of Large Language Models

Jinwu Hu, Wei Zhang, Yufeng Wang, Yu Hu, Bin Xiao, Mingkui Tan, Qing Du

TL;DR

This paper tackles the cost and efficiency challenges of prompting large language models by introducing Dynamic Compressing Prompts (LLM-DCP). It reframes prompt compression as a Markov Decision Process and employs Hierarchical Prompt Compression to learn compact yet information-preserving prompts without relying on black box LLM rewards. A novel reward function balances compression rate, output fidelity, and retention of key content, while distribution alignment via instruction tuning enables training without external LLM feedback. The approach achieves notable gains over state-of-the-art baselines, particularly at higher compression rates, and demonstrates strong cross task generalization with practical training improvements. The work presents a scalable, task-agnostic framework for efficient LLM inference with substantial practical impact for cost-sensitive deployments.

Abstract

Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. However, these techniques often require lengthy prompts, which increase computational costs and can hinder performance because of the limited context windows of LLMs. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. To tackle these issues, we propose a task-agnostic method called Dynamic Compressing Prompts (LLM-DCP). Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible. We model prompt compression as a Markov Decision Process (MDP), enabling the DCP-Agent to sequentially remove redundant tokens by adapting to dynamic contexts and retaining crucial content. We develop a reward function for training the DCP-Agent that balances the compression rate, the quality of the LLM output, and the retention of key information. This allows for prompt token reduction without needing an external black-box LLM. Inspired by the progressive difficulty adjustment in curriculum learning, we introduce a Hierarchical Prompt Compression (HPC) training strategy that gradually increases the compression difficulty, enabling the DCP-Agent to learn an effective compression method that maintains information integrity. Experiments demonstrate that our method outperforms state-of-the-art techniques, especially at higher compression rates. The code for our approach will be available at https://github.com/Fhujinwu/DCP.

Dynamic Compressing Prompts for Efficient Inference of Large Language Models

TL;DR

Abstract

Dynamic Compressing Prompts for Efficient Inference of Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)