Table of Contents
Fetching ...

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, Tu Vu

TL;DR

SealQA presents a triad of challenging, real-world QA flavors—Seal-0, Seal-Hard, and LongSeal—to stress reasoning under noisy, conflicting web search results in retrieval-augmented LLMs. It demonstrates that frontier models, even with tool-use capabilities and test-time compute, struggle considerably on adversarial questions, and that naive retrieval can worsen performance. The benchmark is built with careful human annotation, rigorous vetting, and an auto-rater reliability study, and is publicly released for ongoing evaluation. Overall, SealQA highlights the need for fundamental advances in robust retrieval, question understanding, and multi-document reasoning in realistic search environments.

Abstract

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

TL;DR

SealQA presents a triad of challenging, real-world QA flavors—Seal-0, Seal-Hard, and LongSeal—to stress reasoning under noisy, conflicting web search results in retrieval-augmented LLMs. It demonstrates that frontier models, even with tool-use capabilities and test-time compute, struggle considerably on adversarial questions, and that naive retrieval can worsen performance. The benchmark is built with careful human annotation, rigorous vetting, and an auto-rater reliability study, and is publicly released for ongoing evaluation. Overall, SealQA highlights the need for fundamental advances in robust retrieval, question understanding, and multi-document reasoning in realistic search environments.

Abstract

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.

Paper Structure

This paper contains 36 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Test-time scaling does not lead to reliable gains on SealQA questions, with performance often plateauing or even declining early.
  • Figure 2: Accuracy of LLMs across benchmarks. SealQA poses significant challenges to frontier models.
  • Figure 3: SealQA requires intensive reasoning to resolve ambiguity, filter out misinformation, or reconcile conflicting evidence.
  • Figure 4: SealQA questions test a broad range of reasoning skills that are often overlooked in existing benchmarks.
  • Figure 5: Advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results.
  • ...and 2 more figures