Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages
Noam Dahan, Omer Kidron, Gabriel Stanovsky
TL;DR
This paper addresses the scarcity of high-quality summarization data in low-resource languages by leveraging front-page teasers from digitized historical newspapers as naturally annotated summaries. It introduces a two-step data-collection approach to extract teaser-based (teaser, article) pairs and validates the method across seven languages, culminating in HebTeaseSum, a 7,774-sample Hebrew multi-document corpus built from a single title. The authors also develop an automatic teaser-article extraction pipeline, evaluating teaser identification and matching methods (TF-IDF, sentence-transformer, and zero-shot LLMs) and demonstrating the feasibility of large-scale data generation. The findings show that while LLMs can produce coherent summaries, coverage gaps persist, especially in lower-resource languages, underscoring the need for curated datasets and OCR-corrected data to enable robust evaluation and fine-tuning for multilingual summarization."
Abstract
High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.
