Table of Contents
Fetching ...

A dataset of Open Source Intelligence (OSINT) Tweets about the Russo-Ukrainian war

Johannes Niu, Mila Stillman, Philipp Seeberger, Anna Kruspe

TL;DR

This work addresses OSINT-focused discourse on Twitter surrounding the Russo-Ukrainian war by building a targeted dataset through a two-step snowball sampling approach. The authors identify relevant OSINT accounts and collect top-level Tweets from January 2022 to July 2023, resulting in about 1.9 million Tweets from 1,040 users, including substantial media and external links. First analyses cover temporal trends, language distribution, hashtags, and embedded Tweets, while initial experiments apply relevance classification and clustering to reveal topics and assess misinformation potential. The dataset offers a valuable, complementary resource to broader war-related Twitter datasets and supports OSINT research on information diffusion and misinformation, with publicly available data and clear directions for future enhancements and ethical considerations.

Abstract

Open Source Intelligence (OSINT) refers to intelligence efforts based on freely available data. It has become a frequent topic of conversation on social media, where private users or networks can share their findings. Such data is highly valuable in conflicts, both for gaining a new understanding of the situation as well as for tracking the spread of misinformation. In this paper, we present a method for collecting such data as well as a novel OSINT dataset for the Russo-Ukrainian war drawn from Twitter between January 2022 and July 2023. It is based on an initial search of users posting OSINT and a subsequent snowballing approach to detect more. The final dataset contains almost 2 million Tweets posted by 1040 users. We also provide some first analyses and experiments on the data, and make suggestions for its future usage.

A dataset of Open Source Intelligence (OSINT) Tweets about the Russo-Ukrainian war

TL;DR

This work addresses OSINT-focused discourse on Twitter surrounding the Russo-Ukrainian war by building a targeted dataset through a two-step snowball sampling approach. The authors identify relevant OSINT accounts and collect top-level Tweets from January 2022 to July 2023, resulting in about 1.9 million Tweets from 1,040 users, including substantial media and external links. First analyses cover temporal trends, language distribution, hashtags, and embedded Tweets, while initial experiments apply relevance classification and clustering to reveal topics and assess misinformation potential. The dataset offers a valuable, complementary resource to broader war-related Twitter datasets and supports OSINT research on information diffusion and misinformation, with publicly available data and clear directions for future enhancements and ethical considerations.

Abstract

Open Source Intelligence (OSINT) refers to intelligence efforts based on freely available data. It has become a frequent topic of conversation on social media, where private users or networks can share their findings. Such data is highly valuable in conflicts, both for gaining a new understanding of the situation as well as for tracking the spread of misinformation. In this paper, we present a method for collecting such data as well as a novel OSINT dataset for the Russo-Ukrainian war drawn from Twitter between January 2022 and July 2023. It is based on an initial search of users posting OSINT and a subsequent snowballing approach to detect more. The final dataset contains almost 2 million Tweets posted by 1040 users. We also provide some first analyses and experiments on the data, and make suggestions for its future usage.
Paper Structure (18 sections, 7 figures, 5 tables)

This paper contains 18 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Number of Tweets retrieved in the initial search on a monthly basis.
  • Figure 2: Snowball sampling approach with two iterations until reaching saturation for Tweets containing both "OSINT" and a country search term.
  • Figure 3: Number of collected Tweets in the dataset between January 2022 and July 2023 on a monthly basis.
  • Figure 4: Dataset statistics
  • Figure 5: Language distribution in the dataset according to the language codes of the fasttext model. The 80 most frequent language codes are presented. Logarithmic scale was used for visualization purposes.
  • ...and 2 more figures