Table of Contents
Fetching ...

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

Preetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta, Chirag Shah

TL;DR

iAgentBench is presented, a dynamic ODQA benchmark that targets higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior, and is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis.

Abstract

With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

TL;DR

iAgentBench is presented, a dynamic ODQA benchmark that targets higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior, and is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis.

Abstract

With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.
Paper Structure (29 sections, 6 equations, 3 figures, 2 tables)

This paper contains 29 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the iAgentBench construction pipeline. We (1) sample time-indexed, high-traffic seed queries from public attention signals (GDELT), (2) retrieve a query-conditioned web corpus and extract a claim-like story graph with thematic communities, (3) assign community roles (Core/Bridge/Satellite) and build compact artifacts (community cards, connectors, packets) that preserve cross-theme links, and (4) generate and filter standalone ODQA pairs using a panel of LLM judges.
  • Figure 2: Base vs. RAG accuracy across datasets and models. Points above the diagonal indicate gains from evidence access.
  • Figure 3: Retrieval gains vs. agentic gains, decomposing improvements into $\Delta_{\mathrm{RAG}}=\mathrm{Acc}(\mathrm{RAG})-\mathrm{Acc}(\mathrm{Base})$ and $\Delta_{\mathrm{Refl}}=\mathrm{Acc}(\mathrm{Refl})-\mathrm{Acc}(\mathrm{RAG})$. Positive $\Delta_{\mathrm{Refl}}$ means iteration helps beyond RAG, while negative values indicate regressions.