A novel multi-threaded web crawling model
Weijie. Jiang
TL;DR
The paper addresses efficient large-scale web data acquisition in the face of explosive online information growth. It proposes a novel multi-threaded crawling model that partitions data into sub-tasks processed by parallel threads, employing a producer-consumer buffering between crawling, parsing, and writing stages. Through experiments across varying data sizes and thread configurations, the approach yields substantial speedups over single-threaded baselines, with optimal settings around $n=10$, $m=10$, $k=10$ achieving about an 81% reduction in time for 500 URLs. This work demonstrates scalable throughput gains and provides practical guidance on thread configuration for large-scale web data collection.
Abstract
This paper proposes a novel model for web crawling suitable for large-scale web data acquisition. This model first divides web data into several sub-data, with each sub-data corresponding to a thread task. In each thread task, web crawling tasks are concurrently executed, and the crawled data are stored in a buffer queue, awaiting further parsing. The parsing process is also divided into several threads. By establishing the model and continuously conducting crawler tests, it is found that this model is significantly optimized compared to single-threaded approaches.
