iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

Preetam Prabhu Srikar Dammu; Arnav Palkhiwala; Tanya Roosta; Chirag Shah

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

Preetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta, Chirag Shah

TL;DR

iAgentBench is presented, a dynamic ODQA benchmark that targets higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior, and is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis.

Abstract

With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

TL;DR

Abstract

Paper Structure (29 sections, 6 equations, 3 figures, 2 tables)

This paper contains 29 sections, 6 equations, 3 figures, 2 tables.

Introduction
Related Work
Method
Interest-Driven Seeds
Seed candidates from GDELT.
Scoring and selection.
Graph Construction
Graph definition.
Community detection and reports.
Community Roles and Influence
Community meta-graph.
Influence score.
Core, Bridge, and Satellite themes.
Benchmark Instance Construction
Connector Relations
...and 14 more sections

Figures (3)

Figure 1: Overview of the iAgentBench construction pipeline. We (1) sample time-indexed, high-traffic seed queries from public attention signals (GDELT), (2) retrieve a query-conditioned web corpus and extract a claim-like story graph with thematic communities, (3) assign community roles (Core/Bridge/Satellite) and build compact artifacts (community cards, connectors, packets) that preserve cross-theme links, and (4) generate and filter standalone ODQA pairs using a panel of LLM judges.
Figure 2: Base vs. RAG accuracy across datasets and models. Points above the diagonal indicate gains from evidence access.
Figure 3: Retrieval gains vs. agentic gains, decomposing improvements into $\Delta_{\mathrm{RAG}}=\mathrm{Acc}(\mathrm{RAG})-\mathrm{Acc}(\mathrm{Base})$ and $\Delta_{\mathrm{Refl}}=\mathrm{Acc}(\mathrm{Refl})-\mathrm{Acc}(\mathrm{RAG})$. Positive $\Delta_{\mathrm{Refl}}$ means iteration helps beyond RAG, while negative values indicate regressions.

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

TL;DR

Abstract

iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

Authors

TL;DR

Abstract

Table of Contents

Figures (3)