A novel multi-threaded web crawling model

Weijie. Jiang

A novel multi-threaded web crawling model

Weijie. Jiang

TL;DR

The paper addresses efficient large-scale web data acquisition in the face of explosive online information growth. It proposes a novel multi-threaded crawling model that partitions data into sub-tasks processed by parallel threads, employing a producer-consumer buffering between crawling, parsing, and writing stages. Through experiments across varying data sizes and thread configurations, the approach yields substantial speedups over single-threaded baselines, with optimal settings around $n=10$, $m=10$, $k=10$ achieving about an 81% reduction in time for 500 URLs. This work demonstrates scalable throughput gains and provides practical guidance on thread configuration for large-scale web data collection.

Abstract

This paper proposes a novel model for web crawling suitable for large-scale web data acquisition. This model first divides web data into several sub-data, with each sub-data corresponding to a thread task. In each thread task, web crawling tasks are concurrently executed, and the crawled data are stored in a buffer queue, awaiting further parsing. The parsing process is also divided into several threads. By establishing the model and continuously conducting crawler tests, it is found that this model is significantly optimized compared to single-threaded approaches.

A novel multi-threaded web crawling model

TL;DR

achieving about an 81% reduction in time for 500 URLs. This work demonstrates scalable throughput gains and provides practical guidance on thread configuration for large-scale web data collection.

Abstract

Paper Structure (6 sections, 4 figures, 4 tables)

This paper contains 6 sections, 4 figures, 4 tables.

INTRODUCTION
MOTIVATION
MODEL ARCHITECTURE
EXPERIMENT
SUMMARY
FUTURE WORK

Figures (4)

Figure 1: Model Structure.
Figure 2: Crawler Efficiency under Single-threaded Conditions.
Figure 3: Thread Duration When $n=10$, $m=5$, $k=5$
Figure 4: Comparison of Single-threaded and Optimal Multi-threading Times

A novel multi-threaded web crawling model

TL;DR

Abstract

A novel multi-threaded web crawling model

Authors

TL;DR

Abstract

Table of Contents

Figures (4)