Table of Contents
Fetching ...

On the Centralization and Regionalization of the Web

Gautam Akiwate, Kimberly Ruth, Rumaisa Habib, Zakir Durumeric

TL;DR

The paper defines a formal centralization metric using $EMD$ (Wasserstein distance) to measure how far observed provider distributions are from a fully decentralized reference, and applies it to hosting, DNS, TLD, and CA layers across 150 countries. It also introduces usage and endemicity to describe provider reach and geographic concentration, respectively, using CrUX data and active measurements to map cross-layer dependencies. The findings show pronounced country-level variation, with dominant global players like Cloudflare and Let's Encrypt shaping centralization, while regional providers influence centralization in many contexts; insularity and regionalization emerge as key factors driving these patterns. By providing a rigorous, quantitative framework, the work enables nuanced cross-country comparisons and highlights cross-layer interactions and sociopolitical drivers that affect Internet structure and resilience.

Abstract

Over the past decade, Internet centralization and its implications for both people and the resilience of the Internet has become a topic of active debate. While the networking community informally agrees on the definition of centralization, we lack a formal metric for quantifying centralization, which limits research beyond descriptive analysis. In this work, we introduce a statistical measure for Internet centralization, which we use to better understand how the web is centralized across four layers of web infrastructure (hosting providers, DNS infrastructure, TLDs, and certificate authorities) in 150~countries. Our work uncovers significant geographical variation, as well as a complex interplay between centralization and sociopolitically driven regionalization. We hope that our work can serve as the foundation for more nuanced analysis to inform this important debate.

On the Centralization and Regionalization of the Web

TL;DR

The paper defines a formal centralization metric using (Wasserstein distance) to measure how far observed provider distributions are from a fully decentralized reference, and applies it to hosting, DNS, TLD, and CA layers across 150 countries. It also introduces usage and endemicity to describe provider reach and geographic concentration, respectively, using CrUX data and active measurements to map cross-layer dependencies. The findings show pronounced country-level variation, with dominant global players like Cloudflare and Let's Encrypt shaping centralization, while regional providers influence centralization in many contexts; insularity and regionalization emerge as key factors driving these patterns. By providing a rigorous, quantitative framework, the work enables nuanced cross-country comparisons and highlights cross-layer interactions and sociopolitical drivers that affect Internet structure and resilience.

Abstract

Over the past decade, Internet centralization and its implications for both people and the resilience of the Internet has become a topic of active debate. While the networking community informally agrees on the definition of centralization, we lack a formal metric for quantifying centralization, which limits research beyond descriptive analysis. In this work, we introduce a statistical measure for Internet centralization, which we use to better understand how the web is centralized across four layers of web infrastructure (hosting providers, DNS infrastructure, TLDs, and certificate authorities) in 150~countries. Our work uncovers significant geographical variation, as well as a complex interplay between centralization and sociopolitically driven regionalization. We hope that our work can serve as the foundation for more nuanced analysis to inform this important debate.
Paper Structure (48 sections, 5 equations, 22 figures, 8 tables)

This paper contains 48 sections, 5 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Centralization Comparison Example---To calculate the centralization score for the top websites in Countries A and B, we calculate the EMD between the observed distribution in each to a reference uniform distribution. In the example above, the EMD for Countries A and B are 0.28 and 0.32, respectively, indicating that Country A is less centralized than B.
  • Figure 2: Example $\mathscr{S}$ Values---Centralization Score ($\mathscr{S}$) for multiple synthetic distributions. $\mathscr{S}$ values are most sensitive to differences between the highly centralized cases.
  • Figure 3: Usage and Endemicity---Usage ($U$) is the area under the usage curve and endemicity ($\mathcal{E}$) is the area between the usage curve and the horizontal line starting at the usage curve's maximum value. Usage captures popularity, while endemicity captures global consistency in usage. Regional providers have a higher endemicity than global providers.
  • Figure 4: CDF of Hosting Providers---Differences in centralization are largely driven by the distribution of sites amongst the ten largest providers in each country. For both the most centralized (Thailand) and least centralized (Iran) countries, 100 providers account for most sites.
  • Figure 5: Hosting Provider Centralization by Country---Europe is consistently the least centralized, while Asia as a whole shows a lot of variance. Other continents do not tend towards any extremes.
  • ...and 17 more figures