Table of Contents
Fetching ...

AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models

Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Zhiming Zheng

TL;DR

AdaComp is introduced, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality that significantly reduces inference costs while maintaining performance nearly identical to uncompressed models.

Abstract

Retrieved documents containing noise will hinder RAG from detecting answer clues and make the inference process slow and expensive. Therefore, context compression is necessary to enhance its accuracy and efficiency. Existing context compression methods use extractive or generative models to retain the most query-relevant sentences or apply the information bottleneck theory to preserve sufficient information. However, these methods may face issues such as over-compression or high computational costs. We observe that the retriever often ranks relevant documents at the top, but the exact number of documents needed to answer the query is uncertain due to the impact of query complexity and retrieval quality: complex queries like multi-hop questions may require retaining more documents than simpler queries, and a low-quality retrieval may need to rely on more documents to generate accurate outputs. Therefore, determining the minimum number of required documents (compression rate) is still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality. Specifically, we first annotate the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate and then construct triplets of the query, retrieved documents, and its compression rate. Then, we use this triplet dataset to train a compression-rate predictor. Experiments on three QA datasets and one conversational Muiti-doc QA dataset show that AdaComp significantly reduces inference costs while maintaining performance nearly identical to uncompressed models, achieving a balance between efficiency and performance.

AdaComp: Extractive Context Compression with Adaptive Predictor for Retrieval-Augmented Large Language Models

TL;DR

AdaComp is introduced, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality that significantly reduces inference costs while maintaining performance nearly identical to uncompressed models.

Abstract

Retrieved documents containing noise will hinder RAG from detecting answer clues and make the inference process slow and expensive. Therefore, context compression is necessary to enhance its accuracy and efficiency. Existing context compression methods use extractive or generative models to retain the most query-relevant sentences or apply the information bottleneck theory to preserve sufficient information. However, these methods may face issues such as over-compression or high computational costs. We observe that the retriever often ranks relevant documents at the top, but the exact number of documents needed to answer the query is uncertain due to the impact of query complexity and retrieval quality: complex queries like multi-hop questions may require retaining more documents than simpler queries, and a low-quality retrieval may need to rely on more documents to generate accurate outputs. Therefore, determining the minimum number of required documents (compression rate) is still a challenge for RAG. In this paper, we introduce AdaComp, a low-cost extractive context compression method that adaptively determines the compression rate based on both query complexity and retrieval quality. Specifically, we first annotate the minimum top-k documents necessary for the RAG system to answer the current query as the compression rate and then construct triplets of the query, retrieved documents, and its compression rate. Then, we use this triplet dataset to train a compression-rate predictor. Experiments on three QA datasets and one conversational Muiti-doc QA dataset show that AdaComp significantly reduces inference costs while maintaining performance nearly identical to uncompressed models, achieving a balance between efficiency and performance.
Paper Structure (20 sections, 7 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 20 sections, 7 equations, 5 figures, 3 tables, 2 algorithms.

Figures (5)

  • Figure 1: An illustration of how retrieval quality affects the generation results of context compression models. TOP-1 and RECOMP select the most query-relevant sentences, but they produce incorrect answers due to over-compression. ONLY_Doc4 select the $4^{th}$ document as context but it can not answer correctly because it lacks background knowledge about the query. Although TOP-5 can answer correctly, document 5 is irrelevant and should be filtered out.
  • Figure 2: Overall architecture of AdaComp, which includes a retriever module $R$, a compression module $C_\theta$, and a generation module $G$.
  • Figure 3: An illustration of how the number of documents affects final RAG performance, generally, in the beginning, as the number of documents increases, RAG performance improves due to the provision of sufficient information. However, as the number of documents increases excessively, the inclusion of a large amount of noise leads to a decline in RAG performance.
  • Figure 4: Confusion Matrix for Predictor Performance
  • Figure 5: Case Study: answers generated using without RAG, Top-1 document, RECOMP, FILCO, and AdaComp.