Table of Contents
Fetching ...

AutoData: A Multi-Agent System for Open Web Data Collection

Tianyi Ma, Yiyue Qian, Zheyuan Zhang, Zehong Wang, Xiaoye Qian, Feifan Bai, Yifan Ding, Xuwei Luo, Shinan Zhang, Keerthiram Murugesan, Chuxu Zhang, Yanfang Ye

TL;DR

The paper tackles the challenge of scalable, high-quality open web data collection by introducing AutoData, a two-squad multi-agent system coordinated by a central manager. It combines eight specialized agents with a novel oriented hypergraph cache (OHCache) to structure inter-agent communication and artifact sharing, thereby reducing token costs and improving reliability. A new benchmark, Instruct2DS, enables open-web, live-data collection across three domains and symbolic information extraction, with extensive experiments showing AutoData outperforms baselines in accuracy and efficiency and two case studies demonstrating practical applicability. The work advances automated data collection by providing a scalable, cost-aware, and adaptable framework and benchmark for evaluating open web data tasks, while acknowledging limitations and outlining future enhancements.

Abstract

The exponential growth of data-driven systems and AI technologies has intensified the demand for high-quality web-sourced datasets. While existing datasets have proven valuable, conventional web data collection approaches face significant limitations in terms of human effort and scalability. Current data-collecting solutions fall into two categories: wrapper-based methods that struggle with adaptability and reproducibility, and large language model (LLM)-based approaches that incur substantial computational and financial costs. To address these challenges, we propose AutoData, a novel multi-agent system for Automated web Data collection, that requires minimal human intervention, i.e., only necessitating a natural language instruction specifying the desired dataset. In addition, AutoData is designed with a robust multi-agent architecture, featuring a novel oriented message hypergraph coordinated by a central task manager, to efficiently organize agents across research and development squads. Besides, we introduce a novel hypergraph cache system to advance the multi-agent collaboration process that enables efficient automated data collection and mitigates the token cost issues prevalent in existing LLM-based systems. Moreover, we introduce Instruct2DS, a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports. Comprehensive evaluations over Instruct2DS and three existing benchmark datasets demonstrate AutoData's superior performance compared to baseline methods. Case studies on challenging tasks such as picture book collection and paper extraction from surveys further validate its applicability. Our source code and dataset are available at https://github.com/GraphResearcher/AutoData.

AutoData: A Multi-Agent System for Open Web Data Collection

TL;DR

The paper tackles the challenge of scalable, high-quality open web data collection by introducing AutoData, a two-squad multi-agent system coordinated by a central manager. It combines eight specialized agents with a novel oriented hypergraph cache (OHCache) to structure inter-agent communication and artifact sharing, thereby reducing token costs and improving reliability. A new benchmark, Instruct2DS, enables open-web, live-data collection across three domains and symbolic information extraction, with extensive experiments showing AutoData outperforms baselines in accuracy and efficiency and two case studies demonstrating practical applicability. The work advances automated data collection by providing a scalable, cost-aware, and adaptable framework and benchmark for evaluating open web data tasks, while acknowledging limitations and outlining future enhancements.

Abstract

The exponential growth of data-driven systems and AI technologies has intensified the demand for high-quality web-sourced datasets. While existing datasets have proven valuable, conventional web data collection approaches face significant limitations in terms of human effort and scalability. Current data-collecting solutions fall into two categories: wrapper-based methods that struggle with adaptability and reproducibility, and large language model (LLM)-based approaches that incur substantial computational and financial costs. To address these challenges, we propose AutoData, a novel multi-agent system for Automated web Data collection, that requires minimal human intervention, i.e., only necessitating a natural language instruction specifying the desired dataset. In addition, AutoData is designed with a robust multi-agent architecture, featuring a novel oriented message hypergraph coordinated by a central task manager, to efficiently organize agents across research and development squads. Besides, we introduce a novel hypergraph cache system to advance the multi-agent collaboration process that enables efficient automated data collection and mitigates the token cost issues prevalent in existing LLM-based systems. Moreover, we introduce Instruct2DS, a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports. Comprehensive evaluations over Instruct2DS and three existing benchmark datasets demonstrate AutoData's superior performance compared to baseline methods. Case studies on challenging tasks such as picture book collection and paper extraction from surveys further validate its applicability. Our source code and dataset are available at https://github.com/GraphResearcher/AutoData.

Paper Structure

This paper contains 38 sections, 6 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: The overall framework of AutoData. With the input instruction, agents in the research squad collaborate to generate a development blueprint by browsing the web pages. Afterward, the development squad builds the program based on the blueprint and executes the program to obtain the desired dataset. To ensure efficient and effective multi-agent collaboration, we introduce a novel oriented hypergraph cache system for information sharing. As shown in the figure, plan agent ($v_1$) sends the message $m_1$ to agents $v_2, v_3, \text{ and } v_5$, resulting in an oriented hyperedge $e_1$. Next, the web agent ($v_3$) retrieves messages $m_1$ and $m_3$ from oriented message hypergraph $\vec{{\mathcal{G}}}$ for decision making. In addition, we design a hyperedge formatter to formalize the agent messages and a local cache system to store valuable artifacts for subsequent agents to retrieve in an on-demand manner.
  • Figure 2: A sample in Instruct2DS and examples of instruction templates [draw=myred,thick,inner sep=2pt]test Instruct.
  • Figure 3: Ablation studies for AutoData.
  • Figure 4: Example of instruction [draw=myred,thick,inner sep=2pt]test Instruct and a sample in corresponding [draw=myblue,thick,inner sep=2pt]test GT-DS.
  • Figure 5: Academic paper data structure and instruction templates.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 2.1