AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

Wenhao Huang; Zhouhong Gu; Chenghao Peng; Zhixu Li; Jiaqing Liang; Yanghua Xiao; Liqian Wen; Zulong Chen

AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

Wenhao Huang, Zhouhong Gu, Chenghao Peng, Zhixu Li, Jiaqing Liang, Yanghua Xiao, Liqian Wen, Zulong Chen

TL;DR

This work introduces the paradigm of generating web scrapers with LLMs and proposes AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently and proposes a new executability metric for better measuring the performance of web scraper generation tasks.

Abstract

Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabilities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability when faced with a new website, while language agents, empowered by large language models (LLMs), exhibit poor reusability in diverse web environments. In this work, we introduce the paradigm of generating web scrapers with LLMs and propose AutoScraper, a two-stage framework that can handle diverse and changing web environments more efficiently. AutoScraper leverages the hierarchical structure of HTML and similarity across different web pages for generating web scrapers. Besides, we propose a new executability metric for better measuring the performance of web scraper generation tasks. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness of our framework. Resources of this paper can be found at \url{https://github.com/EZ-hwh/AutoScraper}

AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

TL;DR

Abstract

Paper Structure (41 sections, 6 equations, 4 figures, 18 tables, 1 algorithm)

This paper contains 41 sections, 6 equations, 4 figures, 18 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Task Formulation
Datasets
Swde
Extended Swde
Ds1
Evaluation Metrics
AutoScraper
Modeling
Progressive Generation
Synthesis
Experiment
Experimental Settings & Evaluation Metrics
...and 26 more sections

Figures (4)

Figure 1: An illustration of comparing wrapper-based methods, language-agent-based methods and AutoScraper .
Figure 2: AutoScraper framework of two phases: (a) progressive generation and (b) synthesis.
Figure 3: The performance of AutoScraper with different number of seed websites in Swde dataset.
Figure 4: Comparison of AutoScraper with COT and Reflexion.

AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

TL;DR

Abstract

AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)