Table of Contents
Fetching ...

The 2021 Tokyo Olympics Multilingual News Article Dataset

Erik Novak, Erik Calcina, Dunja Mladenić, Marko Grobelnik

Abstract

In this paper, we introduce a dataset of multilingual news articles covering the 2021 Tokyo Olympics. A total of 10,940 news articles were gathered from 1,918 different publishers, covering 1,350 sub-events of the 2021 Olympics, and published between July 1, 2021, and August 14, 2021. These articles are written in nine languages from different language families and in different scripts. To create the dataset, the raw news articles were first retrieved via a service that collects and analyzes news articles. Then, the articles were grouped using an online clustering algorithm, with each group containing articles reporting on the same sub-event. Finally, the groups were manually annotated and evaluated. The development of this dataset aims to provide a resource for evaluating the performance of multilingual news clustering algorithms, for which limited datasets are available. It can also be used to analyze the dynamics and events of the 2021 Tokyo Olympics from different perspectives. The dataset is available in CSV format and can be accessed from the CLARIN.SI repository.

The 2021 Tokyo Olympics Multilingual News Article Dataset

Abstract

In this paper, we introduce a dataset of multilingual news articles covering the 2021 Tokyo Olympics. A total of 10,940 news articles were gathered from 1,918 different publishers, covering 1,350 sub-events of the 2021 Olympics, and published between July 1, 2021, and August 14, 2021. These articles are written in nine languages from different language families and in different scripts. To create the dataset, the raw news articles were first retrieved via a service that collects and analyzes news articles. Then, the articles were grouped using an online clustering algorithm, with each group containing articles reporting on the same sub-event. Finally, the groups were manually annotated and evaluated. The development of this dataset aims to provide a resource for evaluating the performance of multilingual news clustering algorithms, for which limited datasets are available. It can also be used to analyze the dynamics and events of the 2021 Tokyo Olympics from different perspectives. The dataset is available in CSV format and can be accessed from the CLARIN.SI repository.

Paper Structure

This paper contains 19 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: The schematic overview of the OG2021 development.
  • Figure 2: The OG2021 article distribution by date. The majority of articles were published between the official start of the Olympic Games (July 23, 2021) and the official end of the Olympic Games (August 8, 2021).
  • Figure 3: The OG2021 article distribution by size. Globally, about 95% of clusters contain 25 or fewer articles.
  • Figure 4: The OG2021 cluster distribution per language. Almost 28% of the clusters contain two or more languages. The lower graph shows the distribution of monolingual clusters across languages.
  • Figure 5: The OG2021 language co-occurrence in clusters. All language pairs appear together in at least one cluster.