A diverse Multilingual News Headlines Dataset from around the World

Felix Leeb; Bernhard Schölkopf

A diverse Multilingual News Headlines Dataset from around the World

Felix Leeb, Bernhard Schölkopf

TL;DR

A basic procedure using a TF-IDF weighted similarity metric to group articles into clusters about the same event reveals intuitive features based on the proximity of the event and unexpectedness of the event.

Abstract

Babel Briefings is a novel dataset featuring 4.7 million news headlines from August 2020 to November 2021, across 30 languages and 54 locations worldwide with English translations of all articles included. Designed for natural language processing and media studies, it serves as a high-quality dataset for training or evaluating language models as well as offering a simple, accessible collection of articles, for example, to analyze global news coverage and cultural narratives. As a simple demonstration of the analyses facilitated by this dataset, we use a basic procedure using a TF-IDF weighted similarity metric to group articles into clusters about the same event. We then visualize the \emph{event signatures} of the event showing articles of which languages appear over time, revealing intuitive features based on the proximity of the event and unexpectedness of the event. The dataset is available on \href{https://www.kaggle.com/datasets/felixludos/babel-briefings}{Kaggle} and \href{https://huggingface.co/datasets/felixludos/babel-briefings}{HuggingFace} with accompanying \href{https://github.com/felixludos/babel-briefings}{GitHub} code.

A diverse Multilingual News Headlines Dataset from around the World

TL;DR

Abstract

Paper Structure (9 sections, 2 equations, 5 figures, 2 tables)

This paper contains 9 sections, 2 equations, 5 figures, 2 tables.

Introduction
Related Work
Dataset
Collection
Structure
Statistics
Analysis
Conclusion
News Headline Dataset Comparison

Figures (5)

Figure 1: Streamplot showing how many articles appear for some of the most popular events in the dataset, when clustering articles by their titles, with most common tokens for each cluster shown in the legend. Note the qualitative similarity between the news coverage over time of these events and the memes of leskovec2009meme, demonstrating the potential of this dataset for studying the evolution of major news coverage over time across the world.
Figure 2: Articles reporting on riots in Washington DC on 6 January 2021. Note how the event is reported in many different languages, but the majority of articles are in English. Additionally, there are several subsequent smaller spikes corresponding to related events, such as the beginning of the formal investigation into the riots.
Figure 3: Articles reporting on Diego Maradona's death on 25 November 2020 (and his declining health in the weeks before). Note how in after a few weeks only Spanish articles about the topic continue to appear, underscoring the relative importance of the event in Spanish-speaking countries.
Figure 4: Articles reporting on the Super Bowl on 7 February 2021. Note how unlike unexpected events (such as in figure \ref{['fig:event1']}), there is a considerable lead up to the event before the peak, showing the media's anticipation of the event.
Figure 5: Articles reporting on a crisis between Israel and Gaza on 10 May 2021. Note the prolonged spike for the duration of the crisis, as well as the significant number of articles in Arabic and Hebrew.

A diverse Multilingual News Headlines Dataset from around the World

TL;DR

Abstract

A diverse Multilingual News Headlines Dataset from around the World

Authors

TL;DR

Abstract

Table of Contents

Figures (5)