Table of Contents
Fetching ...

InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation

Yunjia Xi, Jianghao Lin, Menghui Zhu, Yongzhao Xiao, Zhuoying Ou, Jiaqi Liu, Tong Wan, Bo Chen, Weiwen Liu, Yasheng Wang, Ruiming Tang, Weinan Zhang, Yong Yu

TL;DR

InfoDeepSeek introduces a dynamic benchmark for agentic information seeking within retrieval-augmented generation, addressing the shortcomings of static, fixed-corpus benchmarks. It defines criteria for constructing challenging queries with determinacy, difficulty, and diversity, and presents an Agentic RAG framework with planning, reflection, and multi-tool web exploration. The paper proposes four fine-grained metrics—$ACC$, $IA@k$, $EEU$, and $IC$—and a dual human/LLM evaluation protocol to assess information seeking quality in open web environments. Extensive experiments reveal that current LLMs struggle on agentic tasks, with performance heavily influenced by retrieval quality, search engine choice, and language factors, underscoring the need for improved evidence filtering and test-time compute strategies. Overall, InfoDeepSeek provides a principled, transferable benchmark and evaluation pipeline to advance agentic information seeking in dynamic web settings and points to future automation to scale dataset construction.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous LLM agents into the information seeking process. However, existing benchmarks fall short in evaluating such systems, as they are confined to a static retrieval environment with a fixed, limited corpus} and simple queries that fail to elicit agentic behavior. Moreover, their evaluation protocols assess information seeking effectiveness by pre-defined gold sets of documents, making them unsuitable for the open-ended and dynamic nature of real-world web environments. To bridge this gap, we present InfoDeepSeek, a new benchmark with challenging questions designed for assessing agentic information seeking in real-world, dynamic web environments. We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity. Based on this, we develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes. Through extensive experiments across LLMs, search engines, and question types, InfoDeepSeek reveals nuanced agent behaviors and offers actionable insights for future research.

InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation

TL;DR

InfoDeepSeek introduces a dynamic benchmark for agentic information seeking within retrieval-augmented generation, addressing the shortcomings of static, fixed-corpus benchmarks. It defines criteria for constructing challenging queries with determinacy, difficulty, and diversity, and presents an Agentic RAG framework with planning, reflection, and multi-tool web exploration. The paper proposes four fine-grained metrics—, , , and —and a dual human/LLM evaluation protocol to assess information seeking quality in open web environments. Extensive experiments reveal that current LLMs struggle on agentic tasks, with performance heavily influenced by retrieval quality, search engine choice, and language factors, underscoring the need for improved evidence filtering and test-time compute strategies. Overall, InfoDeepSeek provides a principled, transferable benchmark and evaluation pipeline to advance agentic information seeking in dynamic web settings and points to future automation to scale dataset construction.

Abstract

Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by grounding responses with retrieved information. As an emerging paradigm, Agentic RAG further enhances this process by introducing autonomous LLM agents into the information seeking process. However, existing benchmarks fall short in evaluating such systems, as they are confined to a static retrieval environment with a fixed, limited corpus} and simple queries that fail to elicit agentic behavior. Moreover, their evaluation protocols assess information seeking effectiveness by pre-defined gold sets of documents, making them unsuitable for the open-ended and dynamic nature of real-world web environments. To bridge this gap, we present InfoDeepSeek, a new benchmark with challenging questions designed for assessing agentic information seeking in real-world, dynamic web environments. We propose a systematic methodology for constructing challenging queries satisfying the criteria of determinacy, difficulty, and diversity. Based on this, we develop the first evaluation framework tailored to dynamic agentic information seeking, including fine-grained metrics about the accuracy, utility, and compactness of information seeking outcomes. Through extensive experiments across LLMs, search engines, and question types, InfoDeepSeek reveals nuanced agent behaviors and offers actionable insights for future research.

Paper Structure

This paper contains 41 sections, 5 equations, 6 figures, 18 tables.

Figures (6)

  • Figure 1: Comparison between traditional RAG benchmark (up) and our InfoDeepSeek (bottom).
  • Figure 2: The construction workflow of InfoDeepSeek dataset.
  • Figure 3: Performance of LLMs and search engines across different question attributes.
  • Figure 4: Performance with different maximum step $T$ of information seeking.
  • Figure 5: Retrieval interference (a) and the impact of languages (b).
  • ...and 1 more figures