Table of Contents
Fetching ...

Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping

Guan-Lun Huang, Yuh-Jzer Joung

Abstract

Modern web scraping struggles with dynamic, interactive websites that require more than static HTML parsing. Current methods are often brittle and require manual customization for each site. To address this, we introduce Webscraper, a framework designed to handle the challenges of modern, dynamic web applications. It leverages a Multimodal Large Language Model (MLLM) to autonomously navigate interactive interfaces, invoke specialized tools, and perform structured data extraction in environments where traditional scrapers are ineffective. Webscraper utilizes a structured five-stage prompting procedure and a set of custom-built tools to navigate and extract data from websites following the common ``index-and-content'' architecture. Our experiments, conducted on six news websites, demonstrate that the full Webscraper framework, equipped with both our guiding prompt and specialized tools, achieves a significant improvement in extraction accuracy over the baseline agent Anthropic's Computer Use. We also applied the framework to e-commerce platforms to validate its generalizability.

Webscraper: Leverage Multimodal Large Language Models for Index-Content Web Scraping

Abstract

Modern web scraping struggles with dynamic, interactive websites that require more than static HTML parsing. Current methods are often brittle and require manual customization for each site. To address this, we introduce Webscraper, a framework designed to handle the challenges of modern, dynamic web applications. It leverages a Multimodal Large Language Model (MLLM) to autonomously navigate interactive interfaces, invoke specialized tools, and perform structured data extraction in environments where traditional scrapers are ineffective. Webscraper utilizes a structured five-stage prompting procedure and a set of custom-built tools to navigate and extract data from websites following the common ``index-and-content'' architecture. Our experiments, conducted on six news websites, demonstrate that the full Webscraper framework, equipped with both our guiding prompt and specialized tools, achieves a significant improvement in extraction accuracy over the baseline agent Anthropic's Computer Use. We also applied the framework to e-commerce platforms to validate its generalizability.

Paper Structure

This paper contains 15 sections, 3 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An overview of our proposed system architecture. The agent (Computer Use) receives prompts and uses a combination of native environment tools (top right) and our custom-built scraping tools (bottom right) to perform the extraction task and produce structured data.
  • Figure 2: Convergence of the 95% CI half-width for a representative scenario. "Elbow point" is visible around n=30, after which the curve flattens, indicating diminishing returns for additional runs.
  • Figure 3: Performance comparison of our proposed methods against the baseline across six different news websites.
  • Figure 4: Extraction ambiguity on a Momo product page. The agent must identify the market price among multiple competing fields (e.g., promotional price, and discounted price, highlighted in the red box), which increases task complexity.
  • Figure 5: Cumulative average correctness plotted at 5-run intervals for a representative experimental setting. The plot shows that the performance metric tends to stabilize after approximately 30 runs, with subsequent fluctuations remaining minimal. This visual evidence supports our choice of n=30 as a stable and efficient number of runs.
  • ...and 1 more figures

Theorems & Definitions (4)

  • remark 1
  • remark 2
  • remark 3
  • remark 4