Table of Contents
Fetching ...

Red is Sus: Automated Identification of Low-Quality Service Availability Claims in the US National Broadband Map

Syed Tauhidun Nabi, Zhuowei Wen, Brooke Ritter, Shaddi Hasan

TL;DR

This paper develops a novel dataset of broadband availability consisting of 750k observations from more than 900 US ISPs, derived from a combination of regulatory data and crowdsourced speed tests, and develops a model to classify the accuracy of service provider regulatory filings and achieve AUCs over 0.98.

Abstract

The FCC's National Broadband Map aspires to provide an unprecedented view into broadband availability in the US. However, this map, which also determines eligibility for public grant funding, relies on self-reported data from service providers that in turn have incentives to strategically misrepresent their coverage. In this paper, we develop an approach for automatically identifying these low-quality service claims in the National Broadband Map. To do this, we develop a novel dataset of broadband availability consisting of 750k observations from more than 900 US ISPs, derived from a combination of regulatory data and crowdsourced speed tests. Using this dataset, we develop a model to classify the accuracy of service provider regulatory filings and achieve AUCs over 0.98 for unseen examples. Our approach provides an effective technique to enable policymakers, civil society, and the public to identify portions of the National Broadband Map that are likely to have integrity challenges.

Red is Sus: Automated Identification of Low-Quality Service Availability Claims in the US National Broadband Map

TL;DR

This paper develops a novel dataset of broadband availability consisting of 750k observations from more than 900 US ISPs, derived from a combination of regulatory data and crowdsourced speed tests, and develops a model to classify the accuracy of service provider regulatory filings and achieve AUCs over 0.98.

Abstract

The FCC's National Broadband Map aspires to provide an unprecedented view into broadband availability in the US. However, this map, which also determines eligibility for public grant funding, relies on self-reported data from service providers that in turn have incentives to strategically misrepresent their coverage. In this paper, we develop an approach for automatically identifying these low-quality service claims in the National Broadband Map. To do this, we develop a novel dataset of broadband availability consisting of 750k observations from more than 900 US ISPs, derived from a combination of regulatory data and crowdsourced speed tests. Using this dataset, we develop a model to classify the accuracy of service provider regulatory filings and achieve AUCs over 0.98 for unseen examples. Our approach provides an effective technique to enable policymakers, civil society, and the public to identify portions of the National Broadband Map that are likely to have integrity challenges.

Paper Structure

This paper contains 35 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Count of challenges over time for each major and minor release of the NBM. The first major release, which we focus on in this work, saw nearly two orders of magnitude more challenges than the subsequent release.
  • Figure 2: The state-by-state challenges to the initial Broadband Data Collection (BDC) filings as of June 30, 2022, show significant disparities. A few states contribute nearly 1 million challenges, while many other states report fewer than a thousand challenges.
  • Figure 3: Mean Jaccard Index matrix for provider to ASN mappings by methodology.
  • Figure 4: CDF of locations claimed in the NBM by unmatched providers and all providers. The median and 90th percentile of location claims for all providers are approximately three times higher than those of unmatched providers.
  • Figure 5: The ROC scores for all three cases demonstrate the classifier's ability to distinguish between served and unserved cases, even in stratified holdout state observations (c). Although there is a slight decline in performance for the FCC-adjudicated challenges on the unseen set, the model's accuracy remains above 90% (b).
  • ...and 6 more figures