Efficient Crawling for Scalable Web Data Acquisition (Extended Version)
Antoine Gauquier, Ioana Manolescu, Pierre Senellart
TL;DR
This work tackles scalable retrieval of open statistics datasets from the web by formulating SD acquisition as a graph crawling problem and proving its NP-hardness. It introduces SB-CLASSIFIER, a reinforcement-learning crawler using sleeping bandits and tag-path based link grouping to learn which hyperlinks lead to data targets, complemented by an online URL classifier to estimate rewards without excessive HEAD requests. The approach demonstrates substantial efficiency gains over multiple baselines across 18 diverse websites, achieving high target discovery with a fraction of the crawl, and provides extensive analysis of hyper-parameters, URL classification quality, and an early-stopping mechanism. The results suggest practical impact for data journalism and public research by enabling explainable, budget-bounded SD retrieval, with clear avenues for extending to deep-Web and incremental data updates in future work.
Abstract
Journalistic fact-checking, as well as social or economic research, require analyzing high-quality statistics datasets (SDs, in short). However, retrieving SD corpora at scale may be hard, inefficient, or impossible, depending on how they are published online. To improve open statistics data accessibility, we present a focused Web crawling algorithm that retrieves as many targets, i.e., resources of certain types, as possible, from a given website, in an efficient and scalable way, by crawling (much) less than the full website. We show that optimally solving this problem is intractable, and propose an approach based on reinforcement learning, namely using sleeping bandits. We propose SB-CLASSIFIER, a crawler that efficiently learns which hyperlinks lead to pages that link to many targets, based on the paths leading to the links in their enclosing webpages. Our experiments on websites with millions of webpages show that our crawler is highly efficient, delivering high fractions of a site's targets while crawling only a small part.
