Estimating Optimal Context Length for Hybrid Retrieval-augmented Multi-document Summarization
Adithya Pratapa, Teruko Mitamura
TL;DR
This work addresses how to determine the optimal retrieval context length for retrieval-augmented multi-document summarization by introducing a hybrid approach that couples long-context reasoning with retrieval strategies. It uses a silver-LM panel to generate candidate summaries on a sampled subset, selects top references via Minimum Bayes Risk decoding, and then searches a wide range of retrieval lengths (from $8K$ to $80K$ tokens) to find the length that best aligns with the silver references. Across the SummHay dataset, the proposed method outperforms full-context baselines and standard benchmarks (RULER, HELMET) across model classes and sizes, including very long-context LMs, and generalizes to LM classes outside the panel (e.g., Phi-3). The approach offers practical gains in both performance and efficiency by enabling shorter effective context windows tailored to the retriever, summarizer, and dataset, with potential extensions to open-domain QA and per-example customization.
Abstract
Recent advances in long-context reasoning abilities of language models led to interesting applications in large-scale multi-document summarization. However, prior work has shown that these long-context models are not effective at their claimed context windows. To this end, retrieval-augmented systems provide an efficient and effective alternative. However, their performance can be highly sensitive to the choice of retrieval context length. In this work, we present a hybrid method that combines retrieval-augmented systems with long-context windows supported by recent language models. Our method first estimates the optimal retrieval length as a function of the retriever, summarizer, and dataset. On a randomly sampled subset of the dataset, we use a panel of LLMs to generate a pool of silver references. We use these silver references to estimate the optimal context length for a given RAG system configuration. Our results on the multi-document summarization task showcase the effectiveness of our method across model classes and sizes. We compare against length estimates from strong long-context benchmarks such as RULER and HELMET. Our analysis also highlights the effectiveness of our estimation method for very long-context LMs and its generalization to new classes of LMs.
