Table of Contents
Fetching ...

Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization

Adithya Pratapa, Teruko Mitamura

TL;DR

This work addresses how to determine the optimal retrieval context length for retrieval-augmented multi-document summarization by introducing a hybrid approach that couples long-context reasoning with retrieval strategies. It uses a silver-LM panel to generate candidate summaries on a sampled subset, selects top references via Minimum Bayes Risk decoding, and then searches a wide range of retrieval lengths (from $8K$ to $80K$ tokens) to find the length that best aligns with the silver references. Across the SummHay dataset, the proposed method outperforms full-context baselines and standard benchmarks (RULER, HELMET) across model classes and sizes, including very long-context LMs, and generalizes to LM classes outside the panel (e.g., Phi-3). The approach offers practical gains in both performance and efficiency by enabling shorter effective context windows tailored to the retriever, summarizer, and dataset, with potential extensions to open-domain QA and per-example customization.

Abstract

Recent advances in long-context reasoning abilities of language models led to interesting applications in large-scale multi-document summarization. However, prior work has shown that these long-context models are not effective at their claimed context windows. To this end, retrieval-augmented systems provide an efficient and effective alternative. However, their performance can be highly sensitive to the choice of retrieval context length. In this work, we present a hybrid method that combines retrieval-augmented systems with long-context windows supported by recent language models. Our method first estimates the optimal retrieval length as a function of the retriever, summarizer, and dataset. On a randomly sampled subset of the dataset, we use a panel of LLMs to generate a pool of silver references. We use these silver references to estimate the optimal context length for a given RAG system configuration. Our results on the multi-document summarization task showcase the effectiveness of our method across model classes and sizes. We compare against length estimates from strong long-context benchmarks such as RULER and HELMET. Our analysis also highlights the effectiveness of our estimation method for very long-context LMs and its generalization to new classes of LMs.

Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization

TL;DR

This work addresses how to determine the optimal retrieval context length for retrieval-augmented multi-document summarization by introducing a hybrid approach that couples long-context reasoning with retrieval strategies. It uses a silver-LM panel to generate candidate summaries on a sampled subset, selects top references via Minimum Bayes Risk decoding, and then searches a wide range of retrieval lengths (from to tokens) to find the length that best aligns with the silver references. Across the SummHay dataset, the proposed method outperforms full-context baselines and standard benchmarks (RULER, HELMET) across model classes and sizes, including very long-context LMs, and generalizes to LM classes outside the panel (e.g., Phi-3). The approach offers practical gains in both performance and efficiency by enabling shorter effective context windows tailored to the retriever, summarizer, and dataset, with potential extensions to open-domain QA and per-example customization.

Abstract

Recent advances in long-context reasoning abilities of language models led to interesting applications in large-scale multi-document summarization. However, prior work has shown that these long-context models are not effective at their claimed context windows. To this end, retrieval-augmented systems provide an efficient and effective alternative. However, their performance can be highly sensitive to the choice of retrieval context length. In this work, we present a hybrid method that combines retrieval-augmented systems with long-context windows supported by recent language models. Our method first estimates the optimal retrieval length as a function of the retriever, summarizer, and dataset. On a randomly sampled subset of the dataset, we use a panel of LLMs to generate a pool of silver references. We use these silver references to estimate the optimal context length for a given RAG system configuration. Our results on the multi-document summarization task showcase the effectiveness of our method across model classes and sizes. We compare against length estimates from strong long-context benchmarks such as RULER and HELMET. Our analysis also highlights the effectiveness of our estimation method for very long-context LMs and its generalization to new classes of LMs.

Paper Structure

This paper contains 25 sections, 1 figure, 8 tables.

Figures (1)

  • Figure 1: A schematic overview of our proposed method. Unlike traditional benchmarks, we estimate the optimal retrieval length as a function of dataset, retriever and summarizer. Given a dataset, we first sample a fraction of examples. On this subset, we run a panel of LLMs in a full-context setup to create silver candidates. We then identify the top silver candidates using Minimum Bayes Risk decoding. With the help of these silver candidates, we estimate the optimal retrieval length for the given experiment config.