Table of Contents
Fetching ...

MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection

Cagri Toraman, Oguzhan Ozcelik, Furkan Şahinuç, Fazli Can

TL;DR

MiDe22 introduces a bilingual, multi-event tweet dataset for misinformation detection, featuring 5,284 English and 5,064 Turkish tweets across 2020–2022 events with four engagement types and media links. The authors provide human annotations (True/False/Other) with robust inter-annotator agreement, along with comprehensive data analyses (quantitative, content, and temporal) and baseline model benchmarks spanning BoW, neural, and transformer architectures. Key findings show transformer models outperform baselines and that cross-language effectiveness varies by model (DeBERTa English, XLM-R Turkish), underscoring the value of multilingual and multimodal data for detecting misinformation. The work emphasizes transparency and future directions including multimodal detection, adversarial testing, cross-lingual transfer, and cross-platform applicability, aiming to support robust, real-world misinformation mitigation efforts.

Abstract

The rapid dissemination of misinformation through online social networks poses a pressing issue with harmful consequences jeopardizing human health, public safety, democracy, and the economy; therefore, urgent action is required to address this problem. In this study, we construct a new human-annotated dataset, called MiDe22, having 5,284 English and 5,064 Turkish tweets with their misinformation labels for several recent events between 2020 and 2022, including the Russia-Ukraine war, COVID-19 pandemic, and Refugees. The dataset includes user engagements with the tweets in terms of likes, replies, retweets, and quotes. We also provide a detailed data analysis with descriptive statistics and the experimental results of a benchmark evaluation for misinformation detection.

MiDe22: An Annotated Multi-Event Tweet Dataset for Misinformation Detection

TL;DR

MiDe22 introduces a bilingual, multi-event tweet dataset for misinformation detection, featuring 5,284 English and 5,064 Turkish tweets across 2020–2022 events with four engagement types and media links. The authors provide human annotations (True/False/Other) with robust inter-annotator agreement, along with comprehensive data analyses (quantitative, content, and temporal) and baseline model benchmarks spanning BoW, neural, and transformer architectures. Key findings show transformer models outperform baselines and that cross-language effectiveness varies by model (DeBERTa English, XLM-R Turkish), underscoring the value of multilingual and multimodal data for detecting misinformation. The work emphasizes transparency and future directions including multimodal detection, adversarial testing, cross-lingual transfer, and cross-platform applicability, aiming to support robust, real-world misinformation mitigation efforts.

Abstract

The rapid dissemination of misinformation through online social networks poses a pressing issue with harmful consequences jeopardizing human health, public safety, democracy, and the economy; therefore, urgent action is required to address this problem. In this study, we construct a new human-annotated dataset, called MiDe22, having 5,284 English and 5,064 Turkish tweets with their misinformation labels for several recent events between 2020 and 2022, including the Russia-Ukraine war, COVID-19 pandemic, and Refugees. The dataset includes user engagements with the tweets in terms of likes, replies, retweets, and quotes. We also provide a detailed data analysis with descriptive statistics and the experimental results of a benchmark evaluation for misinformation detection.
Paper Structure (30 sections, 3 figures, 6 tables)

This paper contains 30 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The topics (inner circle) and events (outer circle) in MiDe22 for English (left) and Turkish (right). The areas are proportional to the number of tweets they have.
  • Figure 2: Word clouds for most frequently observed keywords in the (a) English and (b) Turkish datasets for True, False, and Other. Collocations are calculated within a window size of two consecutive words.
  • Figure 3: Temporal distribution of tweets by topics. The y-axis represents the density of tweet counts. The x-axis represents the date that tweets are shared. The events EN03 and EN11 are neglected due to the shorter time range.