Table of Contents
Fetching ...

LiveWeb-IE: A Benchmark For Online Web Information Extraction

Seungbin Yang, Jihwan Kim, Jaemin Choi, Dongjin Kim, Soyoung Yang, ChaeHun Park, Jaegul Choo

Abstract

Web information extraction (WIE) is the task of automatically extracting data from web pages, offering high utility for various applications. The evaluation of WIE systems has traditionally relied on benchmarks built from HTML snapshots captured at a single point in time. However, this offline evaluation paradigm fails to account for the temporally evolving nature of the web; consequently, performance on these static benchmarks often fails to generalize to dynamic real-world scenarios. To bridge this gap, we introduce \dataset, a new benchmark designed for evaluating WIE systems directly against live websites. Based on trusted and permission-granted websites, we curate natural language queries that require information extraction of various data categories, such as text, images, and hyperlinks. We further design these queries to represent four levels of complexity, based on the number and cardinality of attributes to be extracted, enabling a granular assessment of WIE systems. In addition, we propose Visual Grounding Scraper (VGS), a novel multi-stage agentic framework that mimics human cognitive processes by visually narrowing down web page content to extract desired information. Extensive experiments across diverse backbone models demonstrate the effectiveness and robustness of VGS. We believe that this study lays the foundation for developing practical and robust WIE systems.

LiveWeb-IE: A Benchmark For Online Web Information Extraction

Abstract

Web information extraction (WIE) is the task of automatically extracting data from web pages, offering high utility for various applications. The evaluation of WIE systems has traditionally relied on benchmarks built from HTML snapshots captured at a single point in time. However, this offline evaluation paradigm fails to account for the temporally evolving nature of the web; consequently, performance on these static benchmarks often fails to generalize to dynamic real-world scenarios. To bridge this gap, we introduce \dataset, a new benchmark designed for evaluating WIE systems directly against live websites. Based on trusted and permission-granted websites, we curate natural language queries that require information extraction of various data categories, such as text, images, and hyperlinks. We further design these queries to represent four levels of complexity, based on the number and cardinality of attributes to be extracted, enabling a granular assessment of WIE systems. In addition, we propose Visual Grounding Scraper (VGS), a novel multi-stage agentic framework that mimics human cognitive processes by visually narrowing down web page content to extract desired information. Extensive experiments across diverse backbone models demonstrate the effectiveness and robustness of VGS. We believe that this study lays the foundation for developing practical and robust WIE systems.
Paper Structure (69 sections, 4 equations, 31 figures, 9 tables)

This paper contains 69 sections, 4 equations, 31 figures, 9 tables.

Figures (31)

  • Figure 1: Overview of the conventional WIE paradigm's limitations and our solutions involving a new benchmark and an extraction method. (A) While existing offline benchmarks are constructed from static HTML snapshots, LiveWeb-IE evaluates WIE systems on live websites to reflect the evolving nature of the web. (B) On a complex live web page, methods that process full HTML often fail, whereas VGS leverages visual cues from the rendered page for accurate information extraction.
  • Figure 2: Dataset construction pipeline for LiveWeb-IE. We first select a diverse set of websites and group the web pages within each website by layout. We then annotate the attributes, queries, and values for each page group, followed by a human verification process to ensure data quality.
  • Figure 3: The framework of VGS. It sequentially narrows the observation space, from identifying target attributes, grounding the region, pinpointing the exact items, and generating the XPaths.
  • Figure 4: The task type and data category distribution.
  • Figure 5: F1 score comparison on existing web information extraction benchmarks. VGS generally outperforms baselines across different backbone models. Full results are shown in Table \ref{['tab:existing_benchmark']}
  • ...and 26 more figures