Table of Contents
Fetching ...

IsamasRed: A Public Dataset Tracking Reddit Discussions on Israel-Hamas Conflict

Kai Chen, Zihao He, Keith Burghardt, Jingxin Zhang, Kristina Lerman

TL;DR

IsamasRed tackles understanding public discourse on the Israel-Hamas conflict on Reddit by introducing a novel LLM-driven keyword extraction framework to assemble a large-scale dataset. The work yields IsamasRed, consisting of roughly 400K conversations and over 8M comments from Aug–Nov 2023, plus two topic-focused subsets IsamasRed-Z and IsamasRed-P. It contributes an automated keyword extraction pipeline, a comprehensive dataset, and a multi-faceted analysis of engagement, controversy, moral foundations, and emotions in online discourse. The dataset and methods provide a resource for studying ideology, sentiment, and community engagement in geopolitical discussions, with attention to ethical data handling and accessibility constraints.

Abstract

The conflict between Israel and Palestinians significantly escalated after the October 7, 2023 Hamas attack, capturing global attention. To understand the public discourse on this conflict, we present a meticulously compiled dataset-IsamasRed-comprising nearly 400,000 conversations and over 8 million comments from Reddit, spanning from August 2023 to November 2023. We introduce an innovative keyword extraction framework leveraging a large language model to effectively identify pertinent keywords, ensuring a comprehensive data collection. Our initial analysis on the dataset, examining topics, controversy, emotional and moral language trends over time, highlights the emotionally charged and complex nature of the discourse. This dataset aims to enrich the understanding of online discussions, shedding light on the complex interplay between ideology, sentiment, and community engagement in digital spaces.

IsamasRed: A Public Dataset Tracking Reddit Discussions on Israel-Hamas Conflict

TL;DR

IsamasRed tackles understanding public discourse on the Israel-Hamas conflict on Reddit by introducing a novel LLM-driven keyword extraction framework to assemble a large-scale dataset. The work yields IsamasRed, consisting of roughly 400K conversations and over 8M comments from Aug–Nov 2023, plus two topic-focused subsets IsamasRed-Z and IsamasRed-P. It contributes an automated keyword extraction pipeline, a comprehensive dataset, and a multi-faceted analysis of engagement, controversy, moral foundations, and emotions in online discourse. The dataset and methods provide a resource for studying ideology, sentiment, and community engagement in geopolitical discussions, with attention to ethical data handling and accessibility constraints.

Abstract

The conflict between Israel and Palestinians significantly escalated after the October 7, 2023 Hamas attack, capturing global attention. To understand the public discourse on this conflict, we present a meticulously compiled dataset-IsamasRed-comprising nearly 400,000 conversations and over 8 million comments from Reddit, spanning from August 2023 to November 2023. We introduce an innovative keyword extraction framework leveraging a large language model to effectively identify pertinent keywords, ensuring a comprehensive data collection. Our initial analysis on the dataset, examining topics, controversy, emotional and moral language trends over time, highlights the emotionally charged and complex nature of the discourse. This dataset aims to enrich the understanding of online discussions, shedding light on the complex interplay between ideology, sentiment, and community engagement in digital spaces.
Paper Structure (32 sections, 7 figures, 2 tables)

This paper contains 32 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The framework of automated keyword extraction via LLMs. We first retrieve relevant Wikipedia pages using a small set of seed terms. Subsequently, we employ GPT-4 to filter out pages with weak relevance. Each page is then segmented into text chunks to fit the context window of the model. GPT-4 is used to identify keywords from the chunks and score them based on semantic relevance. Finally, the generic keywords not directly related to the topic of interest are filtered out.
  • Figure 2: Distribution of conversation lengths in IsamasRed.
  • Figure 3: Number of submissions and comments posted over time in IsamasRed, IsamasRed-Z, and IsamasRed-P.
  • Figure 4: Top 20 largest subreddits by the number of (a) submissions and (b) comments.
  • Figure 5: Popularity score and the number of unique authors of comments over time.
  • ...and 2 more figures