Newswire: A Large-Scale Structured Database of a Century of Historical News
Emily Silcock, Abhishek Arora, Luca D'Amico-Wong, Melissa Dell
TL;DR
Newswire addresses the lack of a comprehensive historical newswire archive by reconstructing a large-scale, structured dataset from hundreds of terabytes of newspaper scans. The authors develop a multi-stage deep learning pipeline to transcribe front-page articles, de-duplicate reproduced content, georeference datelines, tag topics, and disambiguate named entities to Wikidata/Wikipedia, resulting in 2.7M unique wire articles spanning 1878–1977 with rich metadata. Key contributions include a reproducible detection method for reproduced content, robust georeferencing accuracy of 94.9%, high-precision NER (F1 = 90.4), and a scalable entity disambiguation system yielding 1.12M knowledge-base-linked entities. The dataset, hosted on HuggingFace under CC-BY, enables language-model training on historical text and supports computational social science and digital humanities research, while providing a blueprint for reconstructing large-scale archival corpora from dispersed noisy sources. Limitations such as OCR noise, incomplete preservation, and potential content offensiveness are acknowledged, with guidance on appropriate, responsible usage and an open-source pipeline for replication and extension.
Abstract
In the U.S. historically, local newspapers drew their content largely from newswires like the Associated Press. Historians argue that newswires played a pivotal role in creating a national identity and shared understanding of the world, but there is no comprehensive archive of the content sent over newswires. We reconstruct such an archive by applying a customized deep learning pipeline to hundreds of terabytes of raw image scans from thousands of local newspapers. The resulting dataset contains 2.7 million unique public domain U.S. newswire articles, written between 1878 and 1977. Locations in these articles are georeferenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model. To construct the Newswire dataset, we first recognize newspaper layouts and transcribe around 138 millions structured article texts from raw image scans. We then use a customized neural bi-encoder model to de-duplicate reproduced articles, in the presence of considerable abridgement and noise, quantifying how widely each article was reproduced. A text classifier is used to ensure that we only include newswire articles, which historically are in the public domain. The structured data that accompany the texts provide rich information about the who (disambiguated individuals), what (topics), and where (georeferencing) of the news that millions of Americans read over the course of a century. We also include Library of Congress metadata information about the newspapers that ran the articles on their front pages. The Newswire dataset is useful both for large language modeling - expanding training data beyond what is available from modern web texts - and for studying a diversity of questions in computational linguistics, social science, and the digital humanities.
