Table of Contents
Fetching ...

SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

Shicheng Liu, Kai Sun, Lisheng Fu, Xilun Chen, Xinyuan Zhang, Zhaojiang Lin, Rulin Shao, Yue Liu, Anuj Kumar, Wen-tau Yih, Xin Luna Dong

TL;DR

SCRIBES introduces a novel RL-based framework that generates reusable extraction scripts for groups of structurally similar web pages, enabling web-scale knowledge extraction from semi-structured HTML content. By leveraging layout similarity as a reward signal and incorporating both labeled and unlabeled CommonCrawl data, SCRIBES learns scripts that generalize across pages within a site, reducing per-page LLM cost while maintaining high extraction quality. Empirical results show >13% gains in script quality and >4% gains in downstream QA accuracy (e.g., GPT-4o), along with substantial token-speedups as page group size grows. The approach offers practical benefits for large-scale data curation and pretraining, enabling more efficient incorporation of semi-structured data into downstream tasks and models.

Abstract

Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.

SCRIBES: Web-Scale Script-Based Semi-Structured Data Extraction with Reinforcement Learning

TL;DR

SCRIBES introduces a novel RL-based framework that generates reusable extraction scripts for groups of structurally similar web pages, enabling web-scale knowledge extraction from semi-structured HTML content. By leveraging layout similarity as a reward signal and incorporating both labeled and unlabeled CommonCrawl data, SCRIBES learns scripts that generalize across pages within a site, reducing per-page LLM cost while maintaining high extraction quality. Empirical results show >13% gains in script quality and >4% gains in downstream QA accuracy (e.g., GPT-4o), along with substantial token-speedups as page group size grows. The approach offers practical benefits for large-scale data curation and pretraining, enabling more efficient incorporation of semi-structured data into downstream tasks and models.

Abstract

Semi-structured content in HTML tables, lists, and infoboxes accounts for a substantial share of factual data on the web, yet the formatting complicates usage, and reliably extracting structured information from them remains challenging. Existing methods either lack generalization or are resource-intensive due to per-page LLM inference. In this paper, we introduce SCRIBES (SCRIpt-Based Semi-Structured Content Extraction at Web-Scale), a novel reinforcement learning framework that leverages layout similarity across webpages within the same site as a reward signal. Instead of processing each page individually, SCRIBES generates reusable extraction scripts that can be applied to groups of structurally similar webpages. Our approach further improves by iteratively training on synthetic annotations from in-the-wild CommonCrawl data. Experiments show that our approach outperforms strong baselines by over 13% in script quality and boosts downstream question answering accuracy by more than 4% for GPT-4o, enabling scalable and resource-efficient web information extraction.

Paper Structure

This paper contains 35 sections, 9 equations, 5 figures, 15 tables, 1 algorithm.

Figures (5)

  • Figure 1: SCRIBES organizes similar webpages into groups under each website. During training, the model receives one representative webpage per group as input (pt. 1) and is tasked with generating a single extraction script applicable to all similar webpages within the group (pt. 2). Extraction results are then compared against human annotations for labeled data and synthetic annotations for unlabeled CommonCrawl webpages. The resulting scores are used to update the model weights (pt. 3). At inference time, SCRIBES enables the model to generalize to new, unseen websites by generating scripts that can be applied across similar webpages (pt. 4).
  • Figure 2: Three webpages containing semi-structured content under the same website.
  • Figure 3: Processing pipeline for unlabeled data from CommonCrawl in Section \ref{['sec:reward_signal_unlabeled']}.
  • Figure 4: Performance of our best Q-32B model by amount of structure and page type, showing that websites with more numerous or complex structures are more challenging.
  • Figure 5: An example illustrating Algorithm \ref{['alg:dedup']} is shown here. The original HTML appears on the left, while the compressed HTML is shown on the right. The dashed-highlighted section near the top, containing script and style elements, has been removed. The repeated HTML content near the bottom has been deduplicated, retaining up to $z=3$ elements.