Hostile Counterspeech Drives Users From Hate Subreddits

Daniel Hickey; Matheus Schmitz; Daniel M. T. Fessler; Paul E. Smaldino; Kristina Lerman; Goran Murić; Keith Burghardt

Hostile Counterspeech Drives Users From Hate Subreddits

Daniel Hickey, Matheus Schmitz, Daniel M. T. Fessler, Paul E. Smaldino, Kristina Lerman, Goran Murić, Keith Burghardt

TL;DR

This study investigates whether counterspeech affects participation in online hate communities on Reddit and whether certain counterspeech tactics are more effective. It builds a specialized counterspeech detection model using a newly annotated dataset from 25 hate subreddits, augmented with karma and subreddit context, and uses Mahalanobis matching to causally compare newcomers who receive hostile counterspeech, non-hostile counterspeech, or in-group replies. The results show hostile counterspeech substantially reduces newcomer engagement in hate subreddits (ERR ≈ 0.88), while non-hostile counterspeech has little effect on engagement; general Reddit retention remains largely unaffected. The work provides a publicly available dataset and code, highlighting ethical considerations and the need for nuanced counterspeech strategies to mitigate harms while reducing hate online.

Abstract

Counterspeech -- speech that opposes hate speech -- has gained significant attention recently as a strategy to reduce hate on social media. While previous studies suggest that counterspeech can somewhat reduce hate speech, little is known about its effects on participation in online hate communities, nor which counterspeech tactics reduce harmful behavior. We begin to address these gaps by identifying 25 large hate communities ("subreddits") within Reddit and analyzing the effect of counterspeech on newcomers within these communities. We first construct a new public dataset of carefully annotated counterspeech and non-counterspeech comments within these subreddits. We use this dataset to train a state-of-the-art counterspeech detection model. Next, we use matching to evaluate the causal effects of hostile and non-hostile counterspeech on the engagement of newcomers in hate subreddits. We find that, while non-hostile counterspeech is ineffective at keeping users from fully disengaging from these hate subreddits, a single hostile counterspeech comment substantially reduces both future likelihood of engagement. While offering nuance to the understanding of counterspeech efficacy, these results a) leave unanswered the question of whether hostile counterspeech dissuades newcomers from participation in online hate writ large, or merely drives them into less-moderated and more extreme hate communities, and b) raises ethical considerations about hostile counterspeech, which is both comparatively common and might exacerbate rather than mitigate the net level of antagonism in society. These findings underscore the importance of future work to improve counterspeech tactics and minimize unintended harm.

Hostile Counterspeech Drives Users From Hate Subreddits

TL;DR

Abstract

Paper Structure (2 sections, 1 equation, 11 figures, 4 tables)

This paper contains 2 sections, 1 equation, 11 figures, 4 tables.

Results for all types of newcomer interactions
Additional Robustness Checks

Figures (11)

Figure 1: Schematic of the model and counterspeech experiment. (Top left panel) Reddit comments are first collected from hate subreddits, with counterspeech upsampled based on a model that predicts if text contains counterspeech. These data are sent to annotators who assess whether the text contains counterspeech, discuss disagreements in annotations, and then annotate again for a total of 900 annotated comments. (Top right panel) These data are embedded with RoBERTa, and used to train a neural network whose second hidden later is concatenated with subreddit and karma (up and downvote) information of that comment. (Bottom left panel) We then create a matched pair causal model to determine the effect of hostile and non-hostile counterspeech, as well as the effect of in-group speech within both hate subreddits and counterpart non-hate subreddits. (Bottom right panel) For each condition, we compare users who received a reply to similar users who did not. Finally, the effect of the reply is the ratio between the probability a user continues to post in the subreddit when they receive a reply versus when they do not.
Figure 2: (A) Performance of models for detecting counterspeech using different training data. F1 scores are chosen based on the threshold that maximizes the F1 score. (B) Dataset sizes for Yu et al. yu2022hate and hate community data.
Figure 3: Hostile counterspeech replies significantly reduce engagement in hate subreddits. We plot the engagement risk ratio of non-hate subreddit users replying to each other, hate in-group subreddit users replying to each other, non-hostile counterspeech replies to in-group users and hostile counterspeech replies to in-group users. Distributions of engagement risk ratios for different interaction types. Points represent engagement risk ratios of individual subreddits, while boxplots summarize the overall distributions for each subreddit type. Values of engagement risk ratios greater than one imply that replies are associated with more active users, while values less than one imply that replies are associated with less active users. The lines in the boxes represent medians, while the boxes and outer lines represent the inter-quartile range and 95% quantiles, respectively.
Figure 4: Survival curves of general Reddit retention for each interaction type.
Figure 5: Reply attributes of all interactions. (a) Attacks on commenters, (b) mean toxicity, and (c) mean sentiment.
...and 6 more figures

Hostile Counterspeech Drives Users From Hate Subreddits

TL;DR

Abstract

Hostile Counterspeech Drives Users From Hate Subreddits

Authors

TL;DR

Abstract

Table of Contents

Figures (11)