Table of Contents
Fetching ...

Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

Noam Dahan, Omer Kidron, Gabriel Stanovsky

TL;DR

This paper addresses the scarcity of high-quality summarization data in low-resource languages by leveraging front-page teasers from digitized historical newspapers as naturally annotated summaries. It introduces a two-step data-collection approach to extract teaser-based (teaser, article) pairs and validates the method across seven languages, culminating in HebTeaseSum, a 7,774-sample Hebrew multi-document corpus built from a single title. The authors also develop an automatic teaser-article extraction pipeline, evaluating teaser identification and matching methods (TF-IDF, sentence-transformer, and zero-shot LLMs) and demonstrating the feasibility of large-scale data generation. The findings show that while LLMs can produce coherent summaries, coverage gaps persist, especially in lower-resource languages, underscoring the need for curated datasets and OCR-corrected data to enable robust evaluation and fine-tuning for multilingual summarization."

Abstract

High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.

Leveraging Digitized Newspapers to Collect Summarization Data in Low-Resource Languages

TL;DR

This paper addresses the scarcity of high-quality summarization data in low-resource languages by leveraging front-page teasers from digitized historical newspapers as naturally annotated summaries. It introduces a two-step data-collection approach to extract teaser-based (teaser, article) pairs and validates the method across seven languages, culminating in HebTeaseSum, a 7,774-sample Hebrew multi-document corpus built from a single title. The authors also develop an automatic teaser-article extraction pipeline, evaluating teaser identification and matching methods (TF-IDF, sentence-transformer, and zero-shot LLMs) and demonstrating the feasibility of large-scale data generation. The findings show that while LLMs can produce coherent summaries, coverage gaps persist, especially in lower-resource languages, underscoring the need for curated datasets and OCR-corrected data to enable robust evaluation and fine-tuning for multilingual summarization."

Abstract

High quality summarization data remains scarce in under-represented languages. However, historical newspapers, made available through recent digitization efforts, offer an abundant source of untapped, naturally annotated data. In this work, we present a novel method for collecting naturally occurring summaries via Front-Page Teasers, where editors summarize full length articles. We show that this phenomenon is common across seven diverse languages and supports multi-document summarization. To scale data collection, we develop an automatic process, suited to varying linguistic resource levels. Finally, we apply this process to a Hebrew newspaper title, producing HEBTEASESUM, the first dedicated multi-document summarization dataset in Hebrew.

Paper Structure

This paper contains 39 sections, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Newspaper's front-page teasers are a natural source for high-quality, expert-written summaries of full articles appearing inside the paper. We show that they are common across languages, and lend themselves to straightforward data collection.
  • Figure 2: Front-Page Teasers are common in a variety of languages. Highlighted text shows the reference to relevant pages where the corresponding articles can be found, and serves as a useful signal for identifying summaries. From Left-to-Right: teasers from Rana Blad (Norway), Fréttablaðið (Iceland) and Stampa Sera (Italy). Links to newspapers and translations are in Appendix \ref{['sec:appendixn']}.
  • Figure 3: Our approach for extracting a summarization dataset from printed newspapers. We first find teasers and corresponding page numbers on the front page based on newspaper-specific keywords (e.g., "Full articles on Pages 8-9"). Then, we turn to these pages and find the articles most relevant to the content of the teaser, thus resulting in (teaser summary, relevant articles) pairs.