Table of Contents
Fetching ...

InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

Yaxin Du, Yuanshuo Zhang, Xiyuan Yang, Yifan Zhou, Cheng Wang, Gongyi Zou, Xianghe Pang, Wenhao Wang, Menglan Chen, Shuo Tang, Zhiyu Li, Feiyu Xiong, Siheng Chen

TL;DR

InfoMosaic-Bench introduces the first benchmark for evaluating multi-source information seeking in tool-augmented agents across six domains. It pairs 621 tasks with 77 MCP tools via the InfoMosaic-Flow synthesis pipeline, which grounds problems in verifiable tool outputs and enforces cross-source reasoning. Experiments across 14 closed- and 7 open-source LLMs show that web search alone is insufficient for domain-specific tasks and that domain tools yield selective gains while introducing new failure modes related to tool usage and orchestration. The work highlights a fundamental gap between current web-focused agents and robust, multi-tool information seeking, and it provides a scalable framework and dataset to push progress toward trustworthy, high-stakes domain decision-making.

Abstract

Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools -- and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.

InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

TL;DR

InfoMosaic-Bench introduces the first benchmark for evaluating multi-source information seeking in tool-augmented agents across six domains. It pairs 621 tasks with 77 MCP tools via the InfoMosaic-Flow synthesis pipeline, which grounds problems in verifiable tool outputs and enforces cross-source reasoning. Experiments across 14 closed- and 7 open-source LLMs show that web search alone is insufficient for domain-specific tasks and that domain tools yield selective gains while introducing new failure modes related to tool usage and orchestration. The work highlights a fundamental gap between current web-focused agents and robust, multi-tool information seeking, and it provides a scalable framework and dataset to push progress toward trustworthy, high-stakes domain decision-making.

Abstract

Information seeking is a fundamental requirement for humans. However, existing LLM agents rely heavily on open-web search, which exposes two fundamental weaknesses: online content is noisy and unreliable, and many real-world tasks require precise, domain-specific knowledge unavailable from the web. The emergence of the Model Context Protocol (MCP) now allows agents to interface with thousands of specialized tools, seemingly resolving this limitation. Yet it remains unclear whether agents can effectively leverage such tools -- and more importantly, whether they can integrate them with general-purpose search to solve complex tasks. Therefore, we introduce InfoMosaic-Bench, the first benchmark dedicated to multi-source information seeking in tool-augmented agents. Covering six representative domains (medicine, finance, maps, video, web, and multi-domain integration), InfoMosaic-Bench requires agents to combine general-purpose search with domain-specific tools. Tasks are synthesized with InfoMosaic-Flow, a scalable pipeline that grounds task conditions in verified tool outputs, enforces cross-source dependencies, and filters out shortcut cases solvable by trivial lookup. This design guarantees both reliability and non-triviality. Experiments with 14 state-of-the-art LLM agents reveal three findings: (i) web information alone is insufficient, with GPT-5 achieving only 38.2% accuracy and 67.5% pass rate; (ii) domain tools provide selective but inconsistent benefits, improving some domains while degrading others; and (iii) 22.4% of failures arise from incorrect tool usage or selection, highlighting that current LLMs still struggle with even basic tool handling.

Paper Structure

This paper contains 48 sections, 4 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Overview of InfoMosaic-Bench. The benchmark evaluates multi-source information seeking in tool-augmented agents. (Left) Example query illustrating that single-source web search often fails, while multi-source tool use is required. (Center) Dataset statistics, including 621 samples across six domains, 77 MCP tools, and 14 models (7 closed- and 7 open-sourced). (Right) Radar plot showing domain-wise accuracy across models and the pie chart illustrating sample distribution across domains.
  • Figure 2: Overview of InfoMosaic-Flow. The synthesis pipeline is laid on an organizer–workers architecture, where a single organizer acts as the commander, coordinating multiple domain-specific workers. Stage 1: Information Seeking composing interdependent constraints and grounding them with verified multi-tool outputs to form initial QA pairs; Stage 2: Iterative Refinement revising drafts, pruning shortcuts, and enforcing multi-source reasoning.
  • Figure 3: Num of tool calls v.s. Acc
  • Figure 4: Num of tool calls v.s. Avg PR
  • Figure 5: Num of too calls v.s. Avg Length
  • ...and 6 more figures