Table of Contents
Fetching ...

From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression

Eunseong Choi, Sunkyung Lee, Minjin Choi, June Park, Jongwuk Lee

TL;DR

This paper tackles the problem of long, costly prompts for prompting large language models by introducing Reading ToCompressing (R2C), a prompt compression method that leverages Fusion-in-Decoder cross-attention to identify globally important content across multiple chunks. It hierarchically compresses prompts from chunks to sentences, guided by QA training signals rather than noisy pseudo-labels, and reuses FiD weights trained for QA. Empirically, R2C achieves up to 80% prompt-length reduction and yields improved performance on out-of-domain datasets, while also delivering substantial efficiency gains in end-to-end latency. The approach demonstrates strong generalization across in-domain and out-of-domain tasks and offers a practical path toward efficient long-context prompting for multi-document inputs.

Abstract

Large language models (LLMs) have achieved significant performance gains using advanced prompting techniques over various tasks. However, the increasing length of prompts leads to high computational costs and often obscures crucial information. Prompt compression has been proposed to alleviate these issues, but it faces challenges in (i) capturing the global context and (ii) training the compressor effectively. To tackle these challenges, we introduce a novel prompt compression method, namely Reading To Compressing (R2C), utilizing the Fusion-in-Decoder (FiD) architecture to identify the important information in the prompt. Specifically, the cross-attention scores of the FiD are used to discern essential chunks and sentences from the prompt. R2C effectively captures the global context without compromising semantic consistency while detouring the necessity of pseudo-labels for training the compressor. Empirical results show that R2C retains key contexts, enhancing the LLM performance by 6% in out-of-domain evaluations while reducing the prompt length by 80%.

From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression

TL;DR

This paper tackles the problem of long, costly prompts for prompting large language models by introducing Reading ToCompressing (R2C), a prompt compression method that leverages Fusion-in-Decoder cross-attention to identify globally important content across multiple chunks. It hierarchically compresses prompts from chunks to sentences, guided by QA training signals rather than noisy pseudo-labels, and reuses FiD weights trained for QA. Empirically, R2C achieves up to 80% prompt-length reduction and yields improved performance on out-of-domain datasets, while also delivering substantial efficiency gains in end-to-end latency. The approach demonstrates strong generalization across in-domain and out-of-domain tasks and offers a practical path toward efficient long-context prompting for multi-document inputs.

Abstract

Large language models (LLMs) have achieved significant performance gains using advanced prompting techniques over various tasks. However, the increasing length of prompts leads to high computational costs and often obscures crucial information. Prompt compression has been proposed to alleviate these issues, but it faces challenges in (i) capturing the global context and (ii) training the compressor effectively. To tackle these challenges, we introduce a novel prompt compression method, namely Reading To Compressing (R2C), utilizing the Fusion-in-Decoder (FiD) architecture to identify the important information in the prompt. Specifically, the cross-attention scores of the FiD are used to discern essential chunks and sentences from the prompt. R2C effectively captures the global context without compromising semantic consistency while detouring the necessity of pseudo-labels for training the compressor. Empirical results show that R2C retains key contexts, enhancing the LLM performance by 6% in out-of-domain evaluations while reducing the prompt length by 80%.
Paper Structure (24 sections, 6 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 24 sections, 6 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Multi-document reader, i.e., Fusion-in-decoder (FiD), captures core information by learning to generate answers from lengthy inputs, as highlighted with the dotted red box. The darker the purple color, the higher the cross-attention score in FiD-decoder.
  • Figure 2: The overall framework of Reading ToCompressing (R2C)
  • Figure 3: Performance of GPT-3.5 with various compression methods in LongBench. (a): compression effectiveness-efficiency comparison. (b): effectiveness over varying compression ratios (2x--10x).
  • Figure 4: Performance of LLaMA2-7B with R2C on the NQ dev dataset adjusting (a) the hierarchical ratio $\rho$ and (b) importance coefficient $\gamma$.
  • Figure 5: Case study on the Natural Questions development set. The number of tokens is calculated using a ChatGPT tokenizer excluding system messages. The purple colorbox indicates the core information to generate answers.
  • ...and 5 more figures