Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

Noam Dahan; Omer Kidron; Gabriel Stanovsky

Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

Noam Dahan, Omer Kidron, Gabriel Stanovsky

TL;DR

This paper addresses the scarcity of high-quality summarization data in low-resource languages by leveraging front-page teasers from digitized historical newspapers as naturally annotated summaries. It introduces a two-step data-collection approach to extract teaser-based (teaser, article) pairs and validates the method across seven languages, culminating in HebTeaseSum, a 7,774-sample Hebrew multi-document corpus built from a single title. The authors also develop an automatic teaser-article extraction pipeline, evaluating teaser identification and matching methods (TF-IDF, sentence-transformer, and zero-shot LLMs) and demonstrating the feasibility of large-scale data generation. The findings show that while LLMs can produce coherent summaries, coverage gaps persist, especially in lower-resource languages, underscoring the need for curated datasets and OCR-corrected data to enable robust evaluation and fine-tuning for multilingual summarization."

Abstract

High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.

Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

TL;DR

Abstract

Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)