Table of Contents
Fetching ...

A Scalable Crawling Algorithm Utilizing Noisy Change-Indicating Signals

Róbert Busa-Fekete, Julian Zimmert, András György, Linhai Qiu, Tzu-Wei Sung, Hao Shen, Hyomin Choi, Sharmila Subramaniam, Li Xiao

TL;DR

This paper extends the classical Poisson-change web crawling model to incorporate noisy change-indicating signals (CISs) and proposes a scalable, decentralized discrete policy derived from a continuous-time solution. By modeling CISs with observable and false-positive components, it derives a threshold-based continuous policy and a bandwidth-controlled discrete implementation that maintains a constant total crawl rate. The approach remains robust to partial observability, signal delays, and changing bandwidth, and experiments show near-parity with the optimal continuous policy and notable improvements when CISs are informative. The work delivers practical, scalable strategies for improving refresh freshness while reducing bandwidth spikes, with broad implications for large-scale web crawlers and cache maintenance.

Abstract

Web refresh crawling is the problem of keeping a cache of web pages fresh, that is, having the most recent copy available when a page is requested, given a limited bandwidth available to the crawler. Under the assumption that the change and request events, resp., to each web page follow independent Poisson processes, the optimal scheduling policy was derived by Azar et al. 2018. In this paper, we study an extension of this problem where side information indicating content changes, such as various types of web pings, for example, signals from sitemaps, content delivery networks, etc., is available. Incorporating such side information into the crawling policy is challenging, because (i) the signals can be noisy with false positive events and with missing change events; and (ii) the crawler should achieve a fair performance over web pages regardless of the quality of the side information, which might differ from web page to web page. We propose a scalable crawling algorithm which (i) uses the noisy side information in an optimal way under mild assumptions; (ii) can be deployed without heavy centralized computation; (iii) is able to crawl web pages at a constant total rate without spikes in the total bandwidth usage over any time interval, and automatically adapt to the new optimal solution when the total bandwidth changes without centralized computation. Experiments clearly demonstrate the versatility of our approach.

A Scalable Crawling Algorithm Utilizing Noisy Change-Indicating Signals

TL;DR

This paper extends the classical Poisson-change web crawling model to incorporate noisy change-indicating signals (CISs) and proposes a scalable, decentralized discrete policy derived from a continuous-time solution. By modeling CISs with observable and false-positive components, it derives a threshold-based continuous policy and a bandwidth-controlled discrete implementation that maintains a constant total crawl rate. The approach remains robust to partial observability, signal delays, and changing bandwidth, and experiments show near-parity with the optimal continuous policy and notable improvements when CISs are informative. The work delivers practical, scalable strategies for improving refresh freshness while reducing bandwidth spikes, with broad implications for large-scale web crawlers and cache maintenance.

Abstract

Web refresh crawling is the problem of keeping a cache of web pages fresh, that is, having the most recent copy available when a page is requested, given a limited bandwidth available to the crawler. Under the assumption that the change and request events, resp., to each web page follow independent Poisson processes, the optimal scheduling policy was derived by Azar et al. 2018. In this paper, we study an extension of this problem where side information indicating content changes, such as various types of web pings, for example, signals from sitemaps, content delivery networks, etc., is available. Incorporating such side information into the crawling policy is challenging, because (i) the signals can be noisy with false positive events and with missing change events; and (ii) the crawler should achieve a fair performance over web pages regardless of the quality of the side information, which might differ from web page to web page. We propose a scalable crawling algorithm which (i) uses the noisy side information in an optimal way under mild assumptions; (ii) can be deployed without heavy centralized computation; (iii) is able to crawl web pages at a constant total rate without spikes in the total bandwidth usage over any time interval, and automatically adapt to the new optimal solution when the total bandwidth changes without centralized computation. Experiments clearly demonstrate the versatility of our approach.

Paper Structure

This paper contains 28 sections, 3 theorems, 15 equations, 13 figures, 1 algorithm.

Key Result

Lemma 1

Any policy optimizing the objective $O(\pi)$ defined in (eq:opt_objective) has a decision rule such that it triggers a crawl event $(\tau,i)$ when $\tau^{\textsc{eff}}_{i,\tau} \geq \iota_i$ for some fixed threshold vector $\boldsymbol{\iota}=(\iota_1,\ldots,\iota_m)\in(0,\infty]^m$.

Figures (13)

  • Figure 1: Histogram of precision and recall for URLs with sitemap signals. The histogram is computed by weighting the pages with their importance.
  • Figure 2: Model of the change, request and CI signal processes.
  • Figure 3: Accuracy of discrete policies without using change-indicating signal. The LDS corresponds to the Algorithm 3 in Azar8099 where the input rates are coming from the solution of \ref{['eq:cd_optimization']} with the true change and request rates.
  • Figure 4: Accuracy of GREEDY and GREEDY-CIS with change-indicating signal.
  • Figure 5: Empirical crawl rates for various web pages achieved by discrete policies GREEDY and GREEDY-CIS. Each dot corresponds to a web page for which we computed the rate of the Baseline method and plotted it versus the empirical rate achieved by the corresponding policy. The color of the dots on the indicates (i) the observability of the change sequence that is controlled by $\lambda$ on the left panel, while (ii) it indicates the change rates of the web pages on the right panel.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Lemma 1
  • Theorem 1
  • Lemma 2