Table of Contents
Fetching ...

BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression

Yuankai Li, Jia-Chen Gu, Di Wu, Kai-Wei Chang, Nanyun Peng

TL;DR

BRIEF addresses the latency and long-context degradation in retrieval-augmented generation for multi-hop QA by introducing a lightweight compressor that converts retrieved documents into dense, query-focused textual summaries. It trains the compressor on a fully open-source synthetic data pipeline that emphasizes multi-hop reasoning and proposition-level evidence, enabling a range of LLMs to perform multi-hop reasoning with compressed context. Across HotpotQA, NQ, TriviaQA, and MuSiQue, BRIEF achieves substantially higher compression (up to ~19x) while maintaining competitive EM and F1 scores, outperforming strong baselines like RECOMP and approaching proprietary GPT-3.5 in certain settings. The approach yields notable reductions in computation (GFLOPs) and shows good transferability across LMs, with scalability to longer documents providing a path for further improvements and broader applicability in real-world, latency-constrained QA tasks.

Abstract

Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding. This is particularly serious for multi-hop questions that require a chain of reasoning across documents. To accelerate inference, reduce costs, and minimize distractions, this paper presents BRIEF (Bridging Retrieval and Inference through Evidence Fusion), a lightweight approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context RAG. To enable learning compression for multi-hop reasoning, we curate synthetic data by extracting atomic propositions that encapsulate distinct factoids from the source documents to compose synthetic summaries. Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries and enables a range of LLMs to achieve exceptional open-domain question answering (QA) performance. For example, on HotpotQA, BRIEF improves the compression rate by 2 times compared to the state-of-the-art baseline, while outperforming it by 3.00% EM and 4.16% F1 with Flan-UL2 as the reader model. It also generates more concise summaries than proprietary GPT-3.5, while demonstrating nearly identical QA performance.

BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression

TL;DR

BRIEF addresses the latency and long-context degradation in retrieval-augmented generation for multi-hop QA by introducing a lightweight compressor that converts retrieved documents into dense, query-focused textual summaries. It trains the compressor on a fully open-source synthetic data pipeline that emphasizes multi-hop reasoning and proposition-level evidence, enabling a range of LLMs to perform multi-hop reasoning with compressed context. Across HotpotQA, NQ, TriviaQA, and MuSiQue, BRIEF achieves substantially higher compression (up to ~19x) while maintaining competitive EM and F1 scores, outperforming strong baselines like RECOMP and approaching proprietary GPT-3.5 in certain settings. The approach yields notable reductions in computation (GFLOPs) and shows good transferability across LMs, with scalability to longer documents providing a path for further improvements and broader applicability in real-world, latency-constrained QA tasks.

Abstract

Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge. However, as the number of retrieved documents increases, the input length to LLMs grows linearly, causing a dramatic increase in latency and a degradation in long-context understanding. This is particularly serious for multi-hop questions that require a chain of reasoning across documents. To accelerate inference, reduce costs, and minimize distractions, this paper presents BRIEF (Bridging Retrieval and Inference through Evidence Fusion), a lightweight approach that performs query-aware multi-hop reasoning by compressing retrieved documents into highly dense textual summaries to integrate into in-context RAG. To enable learning compression for multi-hop reasoning, we curate synthetic data by extracting atomic propositions that encapsulate distinct factoids from the source documents to compose synthetic summaries. Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries and enables a range of LLMs to achieve exceptional open-domain question answering (QA) performance. For example, on HotpotQA, BRIEF improves the compression rate by 2 times compared to the state-of-the-art baseline, while outperforming it by 3.00% EM and 4.16% F1 with Flan-UL2 as the reader model. It also generates more concise summaries than proprietary GPT-3.5, while demonstrating nearly identical QA performance.

Paper Structure

This paper contains 35 sections, 1 equation, 11 figures, 4 tables.

Figures (11)

  • Figure 1: A comparison between BRIEF and previous methods. The retrieved documents are compressed into a highly dense textual summary relevant to the query before prepending it as input to an LM. LLMLingua DBLP:conf/emnlp/JiangWLYQ23 struggles to produce fluent natural language due to its token-level compression. RECOMP DBLP:conf/iclr/XuSC24 is limited to collecting evidence in a single logical step, yet it still produces lengthy summaries.
  • Figure 2: An overview of the synthetic data pipeline for training BRIEF. Starting with a seed single-hop question, the pipeline can generate a multi-hop (question, documents, summary) tuple to enhance the awareness of multi-hop reasoning and compression. Meanwhile, it can also generate a single-hop tuple through a simplified process by bypassing the Multi-hop Question Composition and Multi-hop Validation modules.
  • Figure 3: The transfer ability of compressed summaries across LMs. We selected models from the same family to avoid model selection bias.
  • Figure 4: The length change of compressed summaries with respect to the multi-hop nature of questions.
  • Figure 5: The comparison of GFLOPs consumption when processing the top-5 documents with or without compression, using Flan-UL2 as the LM.
  • ...and 6 more figures