Newswire Extraction: A pipeline for extracting newswires from newspaper images
Michael McRae
TL;DR
The paper addresses extracting and attributing wire-service content from scanned newspapers by presenting a multi-stage pipeline that combines layout analysis with a YOLOv10 detector, OCR, SBERT-based wire-service classifiers, noise-robust deduplication to preserve local variants of shared dispatches, and Llama 3.2 for minimal text correction. It delivers high-precision wire attribution while preserving multiple versions of dispatches and provides both raw and corrected texts, with plans to publicly release datasets and code to support replication. The key contributions include a detailed, end-to-end workflow for historical wire extraction, a deduplication strategy that respects variant reprints, and publicly available Newswire Classifiers with high F1 performance (AP 0.9925, UPI 0.9999, NEA 0.9876). This work enables computational historians and social scientists to analyze how wire news circulated and transformed in the American South, facilitating large-scale, replicable studies of mid-20th-century journalism.
Abstract
I describe a new pipeline for extracting wire services (e.g., Associated Press, United Press International, Newspaper Enterprise Association) from newspaper images.
