UKTwitNewsCor: A Dataset of Online Local News Articles for the Study of Local News Provision
Simona Bisiani, Agnes Gulyas, John Wihbey, Bahareh Heravi
TL;DR
UKTwitNewsCor addresses the scarcity and scale challenges in studying UK local journalism by constructing a large, longitudinal dataset of over 2.5 million articles from 360 local outlets (2020–2022) linked to Twitter sharing and enriched with tweet-level engagement and cross-domain duplication metadata. The authors outline a full data-collection pipeline—from domain directory construction and Twitter handle identification to article extraction, deduplication via Locality-Sensitive Hashing, and integration with geographic and ownership metadata—and provide two supplementary files for geographic and ownership sampling. The dataset is published in CSV and SQLite formats under CC BY-NC 4.0 with FAIR-aligned documentation, enabling researchers to analyze production, dissemination, and audience engagement across geography, time, and ownership, while acknowledging limitations such as platform dependence and potential coverage biases. Overall, UKTwitNewsCor offers a historically grounded, scalable resource to study local news ecosystems, informing researchers, policymakers, and industry stakeholders about coverage gaps, content duplication, and the impact of ownership on local journalism.
Abstract
In this paper, we present UKTwitNewsCor, a comprehensive dataset for understanding the content production, dissemination, and audience engagement dynamics of online local media in the UK. It comprises over 2.5 million online news articles published between January 2020 and December 2022 from 360 local outlets. The corpus represents all articles shared on Twitter by the social media accounts of these outlets. We augment the dataset by incorporating social media performance metrics for the articles at the tweet-level. We further augment the dataset by creating metadata about content duplication across domains. Alongside the article dataset, we supply three additional datasets: a directory of local media web domains, one of UK Local Authority Districts, and one of digital local media providers, providing statistics on the coverage scope of UKTwitNewsCor. Our contributions enable comprehensive, longitudinal analysis of UK local media, news trends, and content diversity across multiple platforms and geographic areas. In this paper, we describe the data collection methodology, assess the dataset geographic and media ownership diversity, and outline how researchers, policymakers, and industry stakeholders can leverage UKTwitNewsCor to advance the study of local media.
