Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

Yijiashun Qi; Yijiazhen Qi; Tanmay Wagh

Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

Yijiashun Qi, Yijiazhen Qi, Tanmay Wagh

TL;DR

A coverage estimation framework inspired by ecological species-richness estimators adapted for web-entity populations is introduced, and it is demonstrated that the W$\to$K$\to$W pipeline achieves the highest precision among all methods using the same 213-page crawl budget.

Abstract

Identifying the full landscape of small and medium-sized enterprises (SMEs) in specialized industry sectors is critical for supply-chain resilience, yet existing business databases suffer from substantial coverage gaps -- particularly for sub-tier suppliers and firms in emerging niche markets. We propose a \textbf{Web--Knowledge--Web (W$\to$K$\to$W)} pipeline that iteratively (1)~crawls domain-specific web sources to discover candidate supplier entities, (2)~extracts and consolidates structured knowledge into a heterogeneous knowledge graph, and (3)~uses the knowledge graph's topology and coverage signals to guide subsequent crawling toward under-represented regions of the supplier space. To quantify discovery completeness, we introduce a \textbf{coverage estimation framework} inspired by ecological species-richness estimators (Chao1, ACE) adapted for web-entity populations. Experiments on the semiconductor equipment manufacturing sector (NAICS 333242) demonstrate that the W$\to$K$\to$W pipeline achieves the highest precision (0.138) and F1 (0.118) among all methods using the same 213-page crawl budget, building a knowledge graph of 765 entities and 586 relations while reaching peak recall by iteration~3 with only 112 pages.

Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

TL;DR

A coverage estimation framework inspired by ecological species-richness estimators adapted for web-entity populations is introduced, and it is demonstrated that the W

W pipeline achieves the highest precision among all methods using the same 213-page crawl budget.

Abstract

W)} pipeline that iteratively (1)~crawls domain-specific web sources to discover candidate supplier entities, (2)~extracts and consolidates structured knowledge into a heterogeneous knowledge graph, and (3)~uses the knowledge graph's topology and coverage signals to guide subsequent crawling toward under-represented regions of the supplier space. To quantify discovery completeness, we introduce a \textbf{coverage estimation framework} inspired by ecological species-richness estimators (Chao1, ACE) adapted for web-entity populations. Experiments on the semiconductor equipment manufacturing sector (NAICS 333242) demonstrate that the W

W pipeline achieves the highest precision (0.138) and F1 (0.118) among all methods using the same 213-page crawl budget, building a knowledge graph of 765 entities and 586 relations while reaching peak recall by iteration~3 with only 112 pages.

Paper Structure (42 sections, 4 equations, 3 figures, 5 tables)

This paper contains 42 sections, 4 equations, 3 figures, 5 tables.

Introduction
Related Work
Focused and Topical Web Crawling
Knowledge Graph Construction from the Web
Supply-Chain Network Discovery
Species Richness Estimation
Problem Formulation
Methodology: The W$\to$K$\to$W Pipeline
Phase 1: Web $\to$ Knowledge (Entity & Relation Extraction)
Joint NER and Relation Extraction via LLM
Type-Constraint Relation Filtering
Entity Resolution
Phase 2: Knowledge $\to$ Web (Gap-Guided Seed Generation)
Structural Hole Detection
Query Expansion from KG Context
...and 27 more sections

Figures (3)

Figure 1: The W$\to$K$\to$W pipeline. Each iteration crawls web sources (left), extracts entities and relations into the knowledge graph (center), and uses structural gap analysis to generate targeted seeds for the next crawl cycle (right).
Figure 2: Cumulative entity and company discovery over five W$\to$K$\to$W iterations. The non-monotonic discovery rate reflects varying page richness across iterations.
Figure 3: Michaelis-Menten scalability projection. Left: discovery curve with fitted model. Right: projected coverage ratio. The model projects diminishing returns beyond ${\sim}20$ iterations.

Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

TL;DR

Abstract

Coverage-Aware Web Crawling for Domain-Specific Supplier Discovery via a Web--Knowledge--Web Pipeline

Authors

TL;DR

Abstract

Table of Contents

Figures (3)