Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

Antoine Gauquier; Ioana Manolescu; Pierre Senellart

Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

Antoine Gauquier, Ioana Manolescu, Pierre Senellart

TL;DR

This work tackles scalable retrieval of open statistics datasets from the web by formulating SD acquisition as a graph crawling problem and proving its NP-hardness. It introduces SB-CLASSIFIER, a reinforcement-learning crawler using sleeping bandits and tag-path based link grouping to learn which hyperlinks lead to data targets, complemented by an online URL classifier to estimate rewards without excessive HEAD requests. The approach demonstrates substantial efficiency gains over multiple baselines across 18 diverse websites, achieving high target discovery with a fraction of the crawl, and provides extensive analysis of hyper-parameters, URL classification quality, and an early-stopping mechanism. The results suggest practical impact for data journalism and public research by enabling explainable, budget-bounded SD retrieval, with clear avenues for extending to deep-Web and incremental data updates in future work.

Abstract

Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites with millions of webpages show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.

Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

TL;DR

Abstract

Paper Structure (28 sections, 15 figures, 17 tables, 4 algorithms)

This paper contains 28 sections, 15 figures, 17 tables, 4 algorithms.

Introduction
Problem Statement and Modeling
Graph Crawling Problem
Graph Crawling Problem
Data Acquisition as Graph Crawling
Data Acquisition as Graph Crawling
Crawling based on Reinforcement Learning
Environment: States and Actions
Grouping Links into Actions
Estimating Rewards with URL Classifier
Crawling Algorithm
Experimental Results
Websites
Websites
Search Engines and Dataset Search
...and 13 more sections

Figures (15)

Figure 1: Graphical summarization of the graph $G_{\textrm{sc}}$
Figure 2: Sample website, crawl, and frontier
Figure 3: Tag paths in an HTML page
Figure 4: Mapping of a tag path into a fixed-size vector.
Figure 5: Comparison of different crawler performance for 10 selected websites presented in Table \ref{['tab:websites_characteristics']}; for TRES, experiments are only shown for fully-crawled websites. SB-CLASSIFIER is the proposed approach.
...and 10 more figures

Theorems & Definitions (2)

Definition 1
Definition 2

Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

TL;DR

Abstract

Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

Authors

TL;DR

Abstract

Table of Contents

Figures (15)

Theorems & Definitions (2)