Table of Contents
Fetching ...

Infogent: An Agent-Based Framework for Web Information Aggregation

Revanth Gangi Reddy, Sagnik Mukherjee, Jeonghwan Kim, Zhenhailong Wang, Dilek Hakkani-Tur, Heng Ji

TL;DR

Infogent introduces a modular, feedback-driven framework for web information aggregation that couples a Navigator, Extractor, and Aggregator to perform backtracking, cross-site exploration, and selective information synthesis. It supports both Direct API-Driven and Interactive Visual Access, achieving state-of-the-art results on FRAMES, FanOutQA, and AssistantBench by leveraging specialized components and cross-modal extraction. The work demonstrates that a clear division of labor among navigation, extraction, and aggregation, guided by aggregator feedback, improves information quality and coverage compared to existing baselines. It also highlights challenges such as navigation reliability, dependence on large models, and the need for broader, real-world aggregation benchmarks and evaluation metrics.

Abstract

Despite seemingly performant web agents on the task-completion benchmarks, most existing methods evaluate the agents based on a presupposition: the web navigation task consists of linear sequence of actions with an end state that marks task completion. In contrast, our work focuses on web navigation for information aggregation, wherein the agent must explore different websites to gather information for a complex query. We consider web information aggregation from two different perspectives: (i) Direct API-driven Access relies on a text-only view of the Web, leveraging external tools such as Google Search API to navigate the web and a scraper to extract website contents. (ii) Interactive Visual Access uses screenshots of the webpages and requires interaction with the browser to navigate and access information. Motivated by these diverse information access settings, we introduce Infogent, a novel modular framework for web information aggregation involving three distinct components: Navigator, Extractor and Aggregator. Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7% under Direct API-Driven Access on FRAMES, and improves over an existing information-seeking web agent by 4.3% under Interactive Visual Access on AssistantBench.

Infogent: An Agent-Based Framework for Web Information Aggregation

TL;DR

Infogent introduces a modular, feedback-driven framework for web information aggregation that couples a Navigator, Extractor, and Aggregator to perform backtracking, cross-site exploration, and selective information synthesis. It supports both Direct API-Driven and Interactive Visual Access, achieving state-of-the-art results on FRAMES, FanOutQA, and AssistantBench by leveraging specialized components and cross-modal extraction. The work demonstrates that a clear division of labor among navigation, extraction, and aggregation, guided by aggregator feedback, improves information quality and coverage compared to existing baselines. It also highlights challenges such as navigation reliability, dependence on large models, and the need for broader, real-world aggregation benchmarks and evaluation metrics.

Abstract

Despite seemingly performant web agents on the task-completion benchmarks, most existing methods evaluate the agents based on a presupposition: the web navigation task consists of linear sequence of actions with an end state that marks task completion. In contrast, our work focuses on web navigation for information aggregation, wherein the agent must explore different websites to gather information for a complex query. We consider web information aggregation from two different perspectives: (i) Direct API-driven Access relies on a text-only view of the Web, leveraging external tools such as Google Search API to navigate the web and a scraper to extract website contents. (ii) Interactive Visual Access uses screenshots of the webpages and requires interaction with the browser to navigate and access information. Motivated by these diverse information access settings, we introduce Infogent, a novel modular framework for web information aggregation involving three distinct components: Navigator, Extractor and Aggregator. Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7% under Direct API-Driven Access on FRAMES, and improves over an existing information-seeking web agent by 4.3% under Interactive Visual Access on AssistantBench.

Paper Structure

This paper contains 34 sections, 1 equation, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of Infogent under the Direct API Access and Interactive Visual Access settings: The Navigator uses a tool-based LLM and a browser-controlling VLM as the web agent respectively, with the Aggregator's textual feedback guiding further navigation.
  • Figure 2: A working example of Infogent. $\mathcal{NG}$ iteratively generates an updated query given feedback from $\mathcal{AG}$.
  • Figure 3: An illustrative example of Infogent in the Interactive Visual Access setting for a query from AssistantBench. In steps 1→4, $\mathcal{AG}$ accurately the identifies the IPO year (2020) and searches for the management team from that year. In step 5, while $\mathcal{ET}$ correctly identifies Gina DiGioia, it incorrectly extrapolates that John Janedia joined in 2020, even though his past affiliations were only mentioned up to that year. However, $\mathcal{AG}$'s feedback to "look for other members" improves the answer coverage by discovering Mike Berkley, whose name was not listed on Fubo's current web page, in an external news article (in step 7) noting his appointment as Chief Product Officer in 2020.
  • Figure 4: Infogent navigation error examples. The navigator falls in dead loops when encountered unusual web elements, such as pop-up windows asking for sharing locations (left) or "answer cards" occasionally appeared at the top of Google search results (right).