A Multi-faceted Semi-Synthetic Dataset for Automated Cyberbullying Detection
Naveed Ejaz, Fakhra Kashif, Salimur Choudhury
TL;DR
The paper tackles the challenge of inconsistent cyberbullying definitions and datasets by proposing a multi-faceted semi-synthetic dataset that encodes aggression, repetition, peer relations, and harm intent. It details a generation pipeline that creates 100 synthetic users with attributes, a peerness matrix, and inter-user messages drawn from aggressive and non-aggressive corpora, and computes an IntentToHarm score. Labels are assigned via thresholding on peerness, repetition, and intent, producing a public dataset with six CSV files and 1,021 cyberbullying conversations among 9,512 distinct user pairs. The contribution supports reproducibility and provides a scalable framework for evaluating cyberbullying detection approaches, with plans to broaden demographics and refine thresholding to enable multi-class classification.
Abstract
In recent years, the rising use of social media has propelled automated cyberbullying detection into a prominent research domain. However, challenges persist due to the absence of a standardized definition and universally accepted datasets. Many researchers now view cyberbullying as a facet of cyberaggression, encompassing factors like repetition, peer relationships, and harmful intent in addition to online aggression. Acquiring comprehensive data reflective of all cyberbullying components from social media networks proves to be a complex task. This paper provides a description of an extensive semi-synthetic cyberbullying dataset that incorporates all of the essential aspects of cyberbullying, including aggression, repetition, peer relationships, and intent to harm. The method of creating the dataset is succinctly outlined, and a detailed overview of the publicly accessible dataset is additionally presented. This accompanying data article provides an in-depth look at the dataset, increasing transparency and enabling replication. It also aids in a deeper understanding of the data, supporting broader research use.
