Table of Contents
Fetching ...

Evaluation of retrieval-based QA on QUEST-LOFT

Nathan Scales, Nathanael Schärli, Olivier Bousquet

TL;DR

This work analyzes why retrieval-based QA under QUEST-LOFT performance headroom persists when using long-context LLMs and demonstrates that carefully engineered retrieval with structured reasoning outputs can significantly outperform long-context baselines. It introduces QUEST-LOFT-128K-Revised and Simple28, accompanied by a modular evaluation framework that mixes retrieval strategies (CiC, RaR, RAG) with justification and optional verification. Key findings show that RAG combined with Justified QA yields substantial gains (e.g., +0.14 F1 over CiC in Pro; minor but consistent gains from verification), that zero-shot prompts can outperform few-shot baselines, and that model choice (Pro vs Flash) modulates the effectiveness of chain-of-thought and QUEST-specific instructions. The results reinforce the continued value of retrieval-based approaches for complex multi-document QA and outline future directions including broader benchmarks, higher-quality curated data, and cost-aware scaling of retrieval at larger corpora.

Abstract

Despite the popularity of retrieval-augmented generation (RAG) as a solution for grounded QA in both academia and industry, current RAG methods struggle with questions where the necessary information is distributed across many documents or where retrieval needs to be combined with complex reasoning. Recently, the LOFT study has shown that this limitation also applies to approaches based on long-context language models, with the QUEST benchmark exhibiting particularly large headroom. In this paper, we provide an in-depth analysis of the factors contributing to the poor performance on QUEST-LOFT, publish updated numbers based on a thorough human evaluation, and demonstrate that RAG can be optimized to significantly outperform long-context approaches when combined with a structured output format containing reasoning and evidence, optionally followed by answer re-verification.

Evaluation of retrieval-based QA on QUEST-LOFT

TL;DR

This work analyzes why retrieval-based QA under QUEST-LOFT performance headroom persists when using long-context LLMs and demonstrates that carefully engineered retrieval with structured reasoning outputs can significantly outperform long-context baselines. It introduces QUEST-LOFT-128K-Revised and Simple28, accompanied by a modular evaluation framework that mixes retrieval strategies (CiC, RaR, RAG) with justification and optional verification. Key findings show that RAG combined with Justified QA yields substantial gains (e.g., +0.14 F1 over CiC in Pro; minor but consistent gains from verification), that zero-shot prompts can outperform few-shot baselines, and that model choice (Pro vs Flash) modulates the effectiveness of chain-of-thought and QUEST-specific instructions. The results reinforce the continued value of retrieval-based approaches for complex multi-document QA and outline future directions including broader benchmarks, higher-quality curated data, and cost-aware scaling of retrieval at larger corpora.

Abstract

Despite the popularity of retrieval-augmented generation (RAG) as a solution for grounded QA in both academia and industry, current RAG methods struggle with questions where the necessary information is distributed across many documents or where retrieval needs to be combined with complex reasoning. Recently, the LOFT study has shown that this limitation also applies to approaches based on long-context language models, with the QUEST benchmark exhibiting particularly large headroom. In this paper, we provide an in-depth analysis of the factors contributing to the poor performance on QUEST-LOFT, publish updated numbers based on a thorough human evaluation, and demonstrate that RAG can be optimized to significantly outperform long-context approaches when combined with a structured output format containing reasoning and evidence, optionally followed by answer re-verification.

Paper Structure

This paper contains 44 sections, 3 figures, 11 tables.

Figures (3)

  • Figure 1: QUEST dataset example
  • Figure 2: Example of LLM output for the baseline QA strategies.
  • Figure 3: Example of LLM output for a Justified QA strategy.