Table of Contents
Fetching ...

Web2Wiki: Characterizing Wikipedia Linking Across the Web

Veniamin Veselovsky, Tiziano Piccardi, Ashton Anderson, Robert West, Akhil Arora

TL;DR

This paper addresses how Wikipedia is referenced across the entire Web, not just within its own site, by introducing Web2Wiki and conducting a large-scale, Common Crawl–driven analysis. It outlines three research questions (what articles are linked, where links appear, and why they are linked) and combines domain-level, page-level, and content-segmentation analyses with topic modeling and a novel two-category linking taxonomy. Key contributions include the Web2Wiki dataset, evidence that Wikipedia links are predominantly delegation-based rather than evidence-based, and insights into topic- and platform-specific linking patterns that position Wikipedia as a global knowledge infrastructure. The work has practical significance for understanding Wikipedia's influence on public discourse, informs potential policy and design decisions, and provides a foundation for multilingual and cross-domain exploration of web-scale linking structures.

Abstract

Wikipedia is one of the most visited websites globally, yet its role beyond its own platform remains largely unexplored. In this paper, we present the first large-scale analysis of how Wikipedia is referenced across the Web. Using a dataset from Common Crawl, we identify over 90 million Wikipedia links spanning 1.68% of Web domains and examine their distribution, context, and function. Our analysis of English Wikipedia reveals three key findings: (1) Wikipedia is most frequently cited by news and science websites for informational purposes, while commercial websites reference it less often. (2) The majority of Wikipedia links appear within the main content rather than in boilerplate or user-generated sections, highlighting their role in structured knowledge presentation. (3) Most links (95%) serve as explanatory references rather than as evidence or attribution, reinforcing Wikipedia's function as a background knowledge provider. While this study focuses on English Wikipedia, our publicly released Web2Wiki dataset includes links from multiple language editions, supporting future research on Wikipedia's global influence on the Web.

Web2Wiki: Characterizing Wikipedia Linking Across the Web

TL;DR

This paper addresses how Wikipedia is referenced across the entire Web, not just within its own site, by introducing Web2Wiki and conducting a large-scale, Common Crawl–driven analysis. It outlines three research questions (what articles are linked, where links appear, and why they are linked) and combines domain-level, page-level, and content-segmentation analyses with topic modeling and a novel two-category linking taxonomy. Key contributions include the Web2Wiki dataset, evidence that Wikipedia links are predominantly delegation-based rather than evidence-based, and insights into topic- and platform-specific linking patterns that position Wikipedia as a global knowledge infrastructure. The work has practical significance for understanding Wikipedia's influence on public discourse, informs potential policy and design decisions, and provides a foundation for multilingual and cross-domain exploration of web-scale linking structures.

Abstract

Wikipedia is one of the most visited websites globally, yet its role beyond its own platform remains largely unexplored. In this paper, we present the first large-scale analysis of how Wikipedia is referenced across the Web. Using a dataset from Common Crawl, we identify over 90 million Wikipedia links spanning 1.68% of Web domains and examine their distribution, context, and function. Our analysis of English Wikipedia reveals three key findings: (1) Wikipedia is most frequently cited by news and science websites for informational purposes, while commercial websites reference it less often. (2) The majority of Wikipedia links appear within the main content rather than in boilerplate or user-generated sections, highlighting their role in structured knowledge presentation. (3) Most links (95%) serve as explanatory references rather than as evidence or attribution, reinforcing Wikipedia's function as a background knowledge provider. While this study focuses on English Wikipedia, our publicly released Web2Wiki dataset includes links from multiple language editions, supporting future research on Wikipedia's global influence on the Web.

Paper Structure

This paper contains 21 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of proportion of Wikipedia-internal links, Web links, and Reddit shares of Wikipedia. A higher score means proportionally more in-links from external websites than as Wikipedia-internal links. Error bars represent 95% bootstrapped confidence intervals.
  • Figure 2: Comparison between websites that link to Wikipedia and a random sample of all websites. For each topic, the difference between the "Web overall" and "Web linking to Wikipedia" probabilities determines its leaning. When the score for "Web linking to Wikipedia" is higher, it implies that the topic tends to link more to Wikipedia, while the reverse implies that Wikipedia articles are underrepresented across that particular topic. Error bars represent 95% bootstrapped confidence intervals.
  • Figure 3: Comparison of where Wikipedia links are likely to occur in a webpage and their Curlie topic. Recall that Homepage2Vec outputs a separate prediction for each Curlie topic, so the probabilities across topics do not necessarily add to 1. Error bars represent 95% bootstrapped confidence intervals.