Table of Contents
Fetching ...

Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior

Antigoni-Maria Founta, Constantinos Djouvas, Despoina Chatzakou, Ilias Leontiadis, Jeremy Blackburn, Gianluca Stringhini, Athena Vakali, Michael Sirivianos, Nicolas Kourtellis

TL;DR

The paper tackles the difficulty of creating ground-truth labels for abusive behavior on Twitter by introducing an iterative crowdsourcing framework that refines label definitions and uses boosted sampling to balance rare abuse signals. It progresses from data collection and exploratory rounds to a large-scale annotation, culminating in an 80k-tweet labeled corpus with four practical categories (Abusive, Hateful, Normal, Spam) and a robust annotation platform. Key findings include strong inter-annotator agreement under the final schema, the insight that Cyberbullying is rarely useful in this context, and the value of boosted sampling for capturing minority classes. The work provides a replicable methodology, open-source tooling, and a valuable resource for researchers building abuse-detection systems and conducting large-scale crowdsourced labeling on social media data.

Abstract

In recent years, offensive, abusive and hateful language, sexism, racism and other types of aggressive and cyberbullying behavior have been manifesting with increased frequency, and in many online social media platforms. In fact, past scientific work focused on studying these forms in popular media, such as Facebook and Twitter. Building on such work, we present an 8-month study of the various forms of abusive behavior on Twitter, in a holistic fashion. Departing from past work, we examine a wide variety of labeling schemes, which cover different forms of abusive behavior, at the same time. We propose an incremental and iterative methodology, that utilizes the power of crowdsourcing to annotate a large scale collection of tweets with a set of abuse-related labels. In fact, by applying our methodology including statistical analysis for label merging or elimination, we identify a reduced but robust set of labels. Finally, we offer a first overview and findings of our collected and annotated dataset of 100 thousand tweets, which we make publicly available for further scientific exploration.

Large Scale Crowdsourcing and Characterization of Twitter Abusive Behavior

TL;DR

The paper tackles the difficulty of creating ground-truth labels for abusive behavior on Twitter by introducing an iterative crowdsourcing framework that refines label definitions and uses boosted sampling to balance rare abuse signals. It progresses from data collection and exploratory rounds to a large-scale annotation, culminating in an 80k-tweet labeled corpus with four practical categories (Abusive, Hateful, Normal, Spam) and a robust annotation platform. Key findings include strong inter-annotator agreement under the final schema, the insight that Cyberbullying is rarely useful in this context, and the value of boosted sampling for capturing minority classes. The work provides a replicable methodology, open-source tooling, and a valuable resource for researchers building abuse-detection systems and conducting large-scale crowdsourced labeling on social media data.

Abstract

In recent years, offensive, abusive and hateful language, sexism, racism and other types of aggressive and cyberbullying behavior have been manifesting with increased frequency, and in many online social media platforms. In fact, past scientific work focused on studying these forms in popular media, such as Facebook and Twitter. Building on such work, we present an 8-month study of the various forms of abusive behavior on Twitter, in a holistic fashion. Departing from past work, we examine a wide variety of labeling schemes, which cover different forms of abusive behavior, at the same time. We propose an incremental and iterative methodology, that utilizes the power of crowdsourcing to annotate a large scale collection of tweets with a set of abuse-related labels. In fact, by applying our methodology including statistical analysis for label merging or elimination, we identify a reduced but robust set of labels. Finally, we offer a first overview and findings of our collected and annotated dataset of 100 thousand tweets, which we make publicly available for further scientific exploration.

Paper Structure

This paper contains 31 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Data Preparation Pipeline (Step 1). Pre-filtering and spam removal to clean tweets. ($A$) random set of un-boosted tweets. ($B$) boosted sampling to produce a set of tweets biased towards abusive behavior. Sub-datasets $D1$ and $D2$ are used in the subsequent Steps 2 and 3.
  • Figure 2: Exploratory Analysis (Step 2). Dataset $D1$ is inputed in the platform for annotation under label set $L$, and in consecutive rounds. In each round, statistical analysis performed can narrow down the set of labels to $L'$. Final set of labels $L"$ can be inputed in Step 3.
  • Figure 3: Final Annotation Round (Step 3). A larger dataset $D2$, with the final label set $L"$ can be used for large scale annotation. A custom-built platform used allows for better control of the annotation flow, and reduces dependencies on CrowdFlower specific design limitations.
  • Figure 4: Distributions of judgments per inappropriate label for the two exploratory rounds in Step 2.
  • Figure 5: Categories of majority distributions for all preliminary rounds in Step 2.
  • ...and 4 more figures