Table of Contents
Fetching ...

ACoRN: Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models

Singon Kim, Gunho Jung, Seong-Whan Lee

TL;DR

ACoRN tackles the challenge of noisy, potentially misleading retrieved documents in retrieval-augmented language models by introducing a noise-robust abstractive compressor. It couples offline data augmentation to simulate retrieval noise with a finetuning strategy that centers summaries on evidential content containing the answer, mitigating the 'lost in the middle' problem. The approach achieves higher EM/F1 and preserves answer strings across NQ, TriviaQA, and PopQA while reducing inference time, particularly in noisy settings. These results demonstrate practical gains for real-world OCQA systems and provide benchmarks and methods for robust summarization under retrieval noise.

Abstract

Abstractive compression utilizes smaller langauge models to condense query-relevant context, reducing computational costs in retrieval-augmented generation (RAG). However,retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language modelbased compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence. ACoRN excels on datasets with many accuracy-reducing documents, making it highly useful in real-world scenarios.

ACoRN: Noise-Robust Abstractive Compression in Retrieval-Augmented Language Models

TL;DR

ACoRN tackles the challenge of noisy, potentially misleading retrieved documents in retrieval-augmented language models by introducing a noise-robust abstractive compressor. It couples offline data augmentation to simulate retrieval noise with a finetuning strategy that centers summaries on evidential content containing the answer, mitigating the 'lost in the middle' problem. The approach achieves higher EM/F1 and preserves answer strings across NQ, TriviaQA, and PopQA while reducing inference time, particularly in noisy settings. These results demonstrate practical gains for real-world OCQA systems and provide benchmarks and methods for robust summarization under retrieval noise.

Abstract

Abstractive compression utilizes smaller langauge models to condense query-relevant context, reducing computational costs in retrieval-augmented generation (RAG). However,retrieved documents often include information that is either irrelevant to answering the query or misleading due to factual incorrect content, despite having high relevance scores. This behavior indicates that abstractive compressors are more likely to omit important information essential for the correct answer, especially in long contexts where attention dispersion occurs. To address this issue, we categorize retrieved documents in a more fine-grained manner and propose Abstractive Compression Robust against Noise (ACoRN), which introduces two novel training steps. First, we use offline data augmentation on the training dataset to enhance compressor robustness against two distinct types of retrieval noise. Second, since the language modelbased compressor cannot fully utilize information from multiple retrieved documents and exhibits positional bias, we perform finetuning to generate summaries centered around key information that directly supports the correct answer. Our experiments demonstrate that T5-large, trained with ACoRN as a compressor, improves EM and F1 scores while preserving the answer string, which could serve as direct evidence. ACoRN excels on datasets with many accuracy-reducing documents, making it highly useful in real-world scenarios.

Paper Structure

This paper contains 22 sections, 9 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An illustrative example of a challenge in retrieving and summarizing information supporting to the correct answer from the documents. The compressor performs well in summarizing content supported to the correct answer when only the document including the correct answer is provided. However, it generates incorrect information or misses the key information when the retrieved documents contain inaccurate or irrelevant information.
  • Figure 2: Overview of Abstractive Compression Robust against Noise (ACoRN). We fine-tune a compressor on our curated training dataset to make it robust against noisy documents and to retrieve evidential documents, focusing on summarizing the query based on the content of the evidential documents.
  • Figure 3: Exact Match (EM) scores for different types of noise documents, including irrelevant documents and factual error documents. Flan-T5-large b29 compresses documents using Query-Focused Summarization (QFS), compressed passages are then passed to LLaMA-3.1-8B-Instruct b36 to generate answers to the queries.
  • Figure 4: Comparison of GPT-3.5-turbo QFS performance when only evidential documents are included in the prompt versus when all top-5 documents are included, based on random sampling of 100 cases for each evidential document count $N$ in top-5 retrieval. When $N$=0 with retrieved only evidential documents means using only internal knowledge. The compressed output is passed to the inference model's prompt, with the language model $M$ being LLaMA-3.1-8B-Instruct. The dotted line represents the performance when summarization is done by randomly sampling 100 instances, regardless of $N$.