Table of Contents
Fetching ...

ExtremeBB: A Database for Large-Scale Research into Online Hate, Harassment, the Manosphere and Extremism

Anh V. Vu, Lydia Wilson, Yi Ting Chua, Ilia Shumailov, Ross Anderson

TL;DR

ExtremeBB provides a scalable, ethically shared textual database of over 53.5 million posts from 38.5 thousand users across 12 extremist forums for studying online hate, harassment, the manosphere, and extremism over two decades. The system uses a dynamic crawler to continuously collect data, stores it as PostgreSQL snapshots, and offers metadata such as posting times, join dates, and reputations, with analyses including post-length distributions, toxicity profiles, and cross-forum user overlaps. The database supports a formal licensing regime that has granted access to multiple institutions and requires ethics approvals and public-domain release of results, enabling near real-time monitoring and intervention assessment while safeguarding researchers. Limitations include partial capture of forums, absence of multimedia, language focus on English, and privacy/legal considerations mitigated by encryption and aggregated analyses.

Abstract

We introduce ExtremeBB, a textual database of over 53.5M posts made by 38.5k users on 12 extremist bulletin board forums promoting online hate, harassment, the manosphere and other forms of extremism. It enables large-scale analyses of qualitative and quantitative historical trends going back two decades: measuring hate speech and toxicity; tracing the evolution of different strands of extremist ideology; tracking the relationships between online subcultures, extremist behaviours, and real-world violence; and monitoring extremist communities in near real time. This can shed light not only on the spread of problematic ideologies but also the effectiveness of interventions. ExtremeBB comes with a robust ethical data-sharing regime that allows us to share data with academics worldwide. Since 2020, access has been granted to 49 licensees in 16 research groups from 12 institutions.

ExtremeBB: A Database for Large-Scale Research into Online Hate, Harassment, the Manosphere and Extremism

TL;DR

ExtremeBB provides a scalable, ethically shared textual database of over 53.5 million posts from 38.5 thousand users across 12 extremist forums for studying online hate, harassment, the manosphere, and extremism over two decades. The system uses a dynamic crawler to continuously collect data, stores it as PostgreSQL snapshots, and offers metadata such as posting times, join dates, and reputations, with analyses including post-length distributions, toxicity profiles, and cross-forum user overlaps. The database supports a formal licensing regime that has granted access to multiple institutions and requires ethics approvals and public-domain release of results, enabling near real-time monitoring and intervention assessment while safeguarding researchers. Limitations include partial capture of forums, absence of multimedia, language focus on English, and privacy/legal considerations mitigated by encryption and aggregated analyses.

Abstract

We introduce ExtremeBB, a textual database of over 53.5M posts made by 38.5k users on 12 extremist bulletin board forums promoting online hate, harassment, the manosphere and other forms of extremism. It enables large-scale analyses of qualitative and quantitative historical trends going back two decades: measuring hate speech and toxicity; tracing the evolution of different strands of extremist ideology; tracking the relationships between online subcultures, extremist behaviours, and real-world violence; and monitoring extremist communities in near real time. This can shed light not only on the spread of problematic ideologies but also the effectiveness of interventions. ExtremeBB comes with a robust ethical data-sharing regime that allows us to share data with academics worldwide. Since 2020, access has been granted to 49 licensees in 16 research groups from 12 institutions.

Paper Structure

This paper contains 3 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Post length distribution of each forum. Red dots are means. Abbreviations are shown in Table \ref{['tab:database-taxonomy']}.
  • Figure 3: The toxicity level of posts. The purple areas: probability density; TX: toxicity, ST: severe toxicity, IA: identity attack, IS: insult, PF: profanity, TH: threat. Red dots are means. A large proportion of posts are not toxic.