Table of Contents
Fetching ...

Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

Antoine Gauquier, Ioana Manolescu, Pierre Senellart

TL;DR

This work tackles scalable retrieval of open statistics datasets from the web by formulating SD acquisition as a graph crawling problem and proving its NP-hardness. It introduces SB-CLASSIFIER, a reinforcement-learning crawler using sleeping bandits and tag-path based link grouping to learn which hyperlinks lead to data targets, complemented by an online URL classifier to estimate rewards without excessive HEAD requests. The approach demonstrates substantial efficiency gains over multiple baselines across 18 diverse websites, achieving high target discovery with a fraction of the crawl, and provides extensive analysis of hyper-parameters, URL classification quality, and an early-stopping mechanism. The results suggest practical impact for data journalism and public research by enabling explainable, budget-bounded SD retrieval, with clear avenues for extending to deep-Web and incremental data updates in future work.

Abstract

Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites with millions of webpages show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.

Efficient Crawling for Scalable Web Data Acquisition (Extended Version)

TL;DR

This work tackles scalable retrieval of open statistics datasets from the web by formulating SD acquisition as a graph crawling problem and proving its NP-hardness. It introduces SB-CLASSIFIER, a reinforcement-learning crawler using sleeping bandits and tag-path based link grouping to learn which hyperlinks lead to data targets, complemented by an online URL classifier to estimate rewards without excessive HEAD requests. The approach demonstrates substantial efficiency gains over multiple baselines across 18 diverse websites, achieving high target discovery with a fraction of the crawl, and provides extensive analysis of hyper-parameters, URL classification quality, and an early-stopping mechanism. The results suggest practical impact for data journalism and public research by enabling explainable, budget-bounded SD retrieval, with clear avenues for extending to deep-Web and incremental data updates in future work.

Abstract

Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites with millions of webpages show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.
Paper Structure (28 sections, 15 figures, 17 tables, 4 algorithms)

This paper contains 28 sections, 15 figures, 17 tables, 4 algorithms.

Figures (15)

  • Figure 1: Graphical summarization of the graph $G_{\textrm{sc}}$
  • Figure 2: Sample website, crawl, and frontier
  • Figure 3: Tag paths in an HTML page
  • Figure 4: Mapping of a tag path into a fixed-size vector.
  • Figure 5: Comparison of different crawler performance for 10 selected websites presented in Table \ref{['tab:websites_characteristics']}; for TRES, experiments are only shown for fully-crawled websites. SB-CLASSIFIER is the proposed approach.
  • ...and 10 more figures

Theorems & Definitions (2)

  • Definition 1
  • Definition 2