Table of Contents
Fetching ...

Collective moderation of hate, toxicity, and extremity in online discussions

Jana Lasser, Alina Herderich, Joshua Garland, Segun Taofeek Aroyehun, David Garcia, Mirta Galesic

TL;DR

This study tackles the rise of hate speech in online discourse by examining a large, four-year German Twitter dataset (130,127 discussion trees, ~1.15 million tweets) to understand how counter-speech strategies affect discourse quality. The authors develop new classifiers for hate, argumentation style, and ingroup/outgroup content, and combine them with established measures of toxicity and extremity, applying micro-level matching and ARDL time-series analyses across micro, meso, and macro levels. They find that expressing simple opinions without insults most effectively reduces subsequent hate, toxicity, and extremity; sarcasm helps in polarized contexts, while constructive interventions can reduce toxicity but may increase extremity. The work demonstrates the potential of collective civic moderation to improve online spaces and offers practical guidance for citizens and organized groups engaging in counter speech, with implications for platform design and future cross-platform studies.

Abstract

In the digital age, hate speech poses a threat to the functioning of social media platforms as spaces for public discourse. Top-down approaches to moderate hate speech encounter difficulties due to conflicts with freedom of expression and issues of scalability. Counter speech, a form of collective moderation by citizens, has emerged as a potential remedy. Here, we aim to investigate which counter speech strategies are most effective in reducing the prevalence of hate, toxicity, and extremity on online platforms. We analyze more than 130,000 discussions on German Twitter starting at the peak of the migrant crisis in 2015 and extending over four years. We use human annotation and machine learning classifiers to identify argumentation strategies, ingroup and outgroup references, emotional tone, and different measures of discourse quality. Using matching and time-series analyses we discern the effectiveness of naturally observed counter speech strategies on the micro-level (individual tweet pairs), meso-level (entire discussions) and macro-level (over days). We find that expressing straightforward opinions, even if not factual but devoid of insults, results in the least subsequent hate, toxicity, and extremity over all levels of analyses. This strategy complements currently recommended counter speech strategies and is easy for citizens to engage in. Sarcasm can also be effective in improving discourse quality, especially in the presence of organized extreme groups. Going beyond one-shot analyses on smaller samples prevalent in most prior studies, our findings have implications for the successful management of public online spaces through collective civic moderation.

Collective moderation of hate, toxicity, and extremity in online discussions

TL;DR

This study tackles the rise of hate speech in online discourse by examining a large, four-year German Twitter dataset (130,127 discussion trees, ~1.15 million tweets) to understand how counter-speech strategies affect discourse quality. The authors develop new classifiers for hate, argumentation style, and ingroup/outgroup content, and combine them with established measures of toxicity and extremity, applying micro-level matching and ARDL time-series analyses across micro, meso, and macro levels. They find that expressing simple opinions without insults most effectively reduces subsequent hate, toxicity, and extremity; sarcasm helps in polarized contexts, while constructive interventions can reduce toxicity but may increase extremity. The work demonstrates the potential of collective civic moderation to improve online spaces and offers practical guidance for citizens and organized groups engaging in counter speech, with implications for platform design and future cross-platform studies.

Abstract

In the digital age, hate speech poses a threat to the functioning of social media platforms as spaces for public discourse. Top-down approaches to moderate hate speech encounter difficulties due to conflicts with freedom of expression and issues of scalability. Counter speech, a form of collective moderation by citizens, has emerged as a potential remedy. Here, we aim to investigate which counter speech strategies are most effective in reducing the prevalence of hate, toxicity, and extremity on online platforms. We analyze more than 130,000 discussions on German Twitter starting at the peak of the migrant crisis in 2015 and extending over four years. We use human annotation and machine learning classifiers to identify argumentation strategies, ingroup and outgroup references, emotional tone, and different measures of discourse quality. Using matching and time-series analyses we discern the effectiveness of naturally observed counter speech strategies on the micro-level (individual tweet pairs), meso-level (entire discussions) and macro-level (over days). We find that expressing straightforward opinions, even if not factual but devoid of insults, results in the least subsequent hate, toxicity, and extremity over all levels of analyses. This strategy complements currently recommended counter speech strategies and is easy for citizens to engage in. Sarcasm can also be effective in improving discourse quality, especially in the presence of organized extreme groups. Going beyond one-shot analyses on smaller samples prevalent in most prior studies, our findings have implications for the successful management of public online spaces through collective civic moderation.
Paper Structure (18 sections, 1 equation, 3 figures)

This paper contains 18 sections, 1 equation, 3 figures.

Figures (3)

  • Figure 1: Measures of discourse quality and dimensions of discourse. (A) Normalized measures of discourse quality over time, with 0 representing their minimum and 1 their maximum value across the whole data set. Raw values are shown in SI Fig. S6. (B) Probability of different argumentation strategies over time. (C) Probability of different goals regarding ingroup/outgroup over time. (D) Probability of different emotional tones over time. Note. "Other" in argumentation strategies refers to ambiguous tweets or tweets that did not fall in any of the other categories. "Other" in Ingroup/outgroup content refers to tweets neutral with respect to group identity or tweets where a speaker's identity was not apparent. All measures are on a scale from 0 to 1. For hate speech, toxicity, argumentation strategies, ingroup/outgroup content, and emotional tone, higher values denote a higher probability that a human rater would perceive a tweet as hateful or toxic, or detect a certain strategy, ingroup/outgroup related goal or emotional tone in the tweet. For extremity of speech, higher values denote a higher classifier probability that a tweet is similar to extreme political speech exemplified either by the discourse of Reconquista Internet or of Reconquista Germanica. For the extremity of speakers, higher values denote a higher relative frequency of speakers whose tweets are labeled as containing extreme political speech. Error bands denote standard errors. All trends are smoothed over a two-week window. Thicker vertical lines denote several relevant events: mc1=beginning and mc2=peak of the migrant crisis, RG=start of Reconquista Germanica, el=2017 German elections, RI=start of Reconquista Internet. Additional details are provided in the SI Section S7.
  • Figure 2: Results of statistical models predicting changes in the probability of quality of discourse. We predicted different indicators of the quality of discourse following tweets characterized by different dimensions of discourse. Positive coefficients mean that a dimension of discourse is related to an increase in hate speech, toxicity, and extremity of speech and speakers, while negative coefficients indicate a decrease. The left panel (A) shows the micro-level effects on a subsequent tweet, obtained via matching analysis. The middle panel (B) shows the direct meso-level effects within discussion trees, calculated as meta-analytic estimates from ARDL models fitted on 3,569 discussion trees. The right panel (C) shows the direct macro-level effects from day to day, obtained from ARDL models fitted on averaged dimensions of discourse over each of 1,461 subsequent days. The icons of Reconquista Germanica (combined letters R and X resembling a sword) and Reconquista Internet (a sign that resembles a heart) denote the direction of reliable interactions with the percentage of extreme speakers resembling one of the groups in each tree (panel B) and with the existence of one or both groups in the public sphere on a specific day (panel C). If an effect of a dimension became more negative (positive) when one or both of these groups were present, we add the respective icon to the left (right) side of the effect. Additional results for lagged effects of discourse dimensions, robustness checks, and tables with all results, are provided in SI Sections S10-S12.
  • Figure 3: Overview of the study. Blue shading: We developed new classifiers to extract dimensions of discourse Argumentation strategy and Ingroup/outgroup content, as well as Hate speech, a measure of discourse quality. Orange shading: Where feasible, we applied pre-existing classifiers to detect discourse dimension Emotional tone and derive other measures of discourse quality - Toxicity and Extremity. We analyzed the relationship between dimensions of discourse and discourse quality on three different levels: 1) the micro level of individual reply pairs (numbers are examples for a tweet containing hate speech as measured by our classifier); 2) the meso level in the remainder of a discussion tree; and 3) the macro level over entire days. Notes: *Column "Variables" lists the classes extracted by the classifiers that were used as predictors in the statistical analyses on all three levels. **Ingroup/outgroup content was extracted with two classifiers in conjunction: classifier GROUP identified whether in- and/or outgroup content was present at all, while classifier GOAL identified the socio-psychological goal of a tweet. Details are provided in the Methods.