Table of Contents
Fetching ...

Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

Yijiashun Qi, Yijiazhen Qi, Tanmay Wagh

TL;DR

A coverage estimation framework inspired by ecological species-richness estimators adapted for web-entity populations is introduced, and it is demonstrated that the W$\to$K$\to$W pipeline achieves the highest precision among all methods using the same 213-page crawl budget.

Abstract

Identifying the full landscape of small and medium-sized enterprises (SMEs) in specialized industry sectors is critical for supply-chain resilience, yet existing business databases suffer from substantial coverage gaps -- particularly for sub-tier suppliers and firms in emerging niche markets. We propose a \textbf{Web--Knowledge--Web (W$\to$K$\to$W)} pipeline that iteratively (1)~crawls domain-specific web sources to discover candidate supplier entities, (2)~extracts and consolidates structured knowledge into a heterogeneous knowledge graph, and (3)~uses the knowledge graph's topology and coverage signals to guide subsequent crawling toward under-represented regions of the supplier space. To quantify discovery completeness, we introduce a \textbf{coverage estimation framework} inspired by ecological species-richness estimators (Chao1, ACE) adapted for web-entity populations. Experiments on the semiconductor equipment manufacturing sector (NAICS 333242) demonstrate that the W$\to$K$\to$W pipeline achieves the highest precision (0.138) and F1 (0.118) among all methods using the same 213-page crawl budget, building a knowledge graph of 765 entities and 586 relations while reaching peak recall by iteration~3 with only 112 pages.

Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

TL;DR

A coverage estimation framework inspired by ecological species-richness estimators adapted for web-entity populations is introduced, and it is demonstrated that the WKW pipeline achieves the highest precision among all methods using the same 213-page crawl budget.

Abstract

Identifying the full landscape of small and medium-sized enterprises (SMEs) in specialized industry sectors is critical for supply-chain resilience, yet existing business databases suffer from substantial coverage gaps -- particularly for sub-tier suppliers and firms in emerging niche markets. We propose a \textbf{Web--Knowledge--Web (WKW)} pipeline that iteratively (1)~crawls domain-specific web sources to discover candidate supplier entities, (2)~extracts and consolidates structured knowledge into a heterogeneous knowledge graph, and (3)~uses the knowledge graph's topology and coverage signals to guide subsequent crawling toward under-represented regions of the supplier space. To quantify discovery completeness, we introduce a \textbf{coverage estimation framework} inspired by ecological species-richness estimators (Chao1, ACE) adapted for web-entity populations. Experiments on the semiconductor equipment manufacturing sector (NAICS 333242) demonstrate that the WKW pipeline achieves the highest precision (0.138) and F1 (0.118) among all methods using the same 213-page crawl budget, building a knowledge graph of 765 entities and 586 relations while reaching peak recall by iteration~3 with only 112 pages.
Paper Structure (42 sections, 4 equations, 3 figures, 5 tables)

This paper contains 42 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The W$\to$K$\to$W pipeline. Each iteration crawls web sources (left), extracts entities and relations into the knowledge graph (center), and uses structural gap analysis to generate targeted seeds for the next crawl cycle (right).
  • Figure 2: Cumulative entity and company discovery over five W$\to$K$\to$W iterations. The non-monotonic discovery rate reflects varying page richness across iterations.
  • Figure 3: Michaelis-Menten scalability projection. Left: discovery curve with fitted model. Right: projected coverage ratio. The model projects diminishing returns beyond ${\sim}20$ iterations.