Table of Contents
Fetching ...

Toxicity Detection is NOT all you Need: Measuring the Gaps to Supporting Volunteer Content Moderators

Yang Trista Cao, Lovely-Frances Domingo, Sarah Ann Gilbert, Michelle Mazurek, Katie Shilton, Hal Daumé

TL;DR

A non trivial gap is observed between past research efforts that have aimed to provide automation for aspects of content moderation and the needs of volunteer content moderators, regarding identifying violations of various moderation rules.

Abstract

Extensive efforts in automated approaches for content moderation have been focused on developing models to identify toxic, offensive, and hateful content with the aim of lightening the load for moderators. Yet, it remains uncertain whether improvements on those tasks have truly addressed moderators' needs in accomplishing their work. In this paper, we surface gaps between past research efforts that have aimed to provide automation for aspects of content moderation and the needs of volunteer content moderators, regarding identifying violations of various moderation rules. To do so, we conduct a model review on Hugging Face to reveal the availability of models to cover various moderation rules and guidelines from three exemplar forums. We further put state-of-the-art LLMs to the test, evaluating how well these models perform in flagging violations of platform rules from one particular forum. Finally, we conduct a user survey study with volunteer moderators to gain insight into their perspectives on useful moderation models. Overall, we observe a non-trivial gap, as missing developed models and LLMs exhibit moderate to low performance on a significant portion of the rules. Moderators' reports provide guides for future work on developing moderation assistant models.

Toxicity Detection is NOT all you Need: Measuring the Gaps to Supporting Volunteer Content Moderators

TL;DR

A non trivial gap is observed between past research efforts that have aimed to provide automation for aspects of content moderation and the needs of volunteer content moderators, regarding identifying violations of various moderation rules.

Abstract

Extensive efforts in automated approaches for content moderation have been focused on developing models to identify toxic, offensive, and hateful content with the aim of lightening the load for moderators. Yet, it remains uncertain whether improvements on those tasks have truly addressed moderators' needs in accomplishing their work. In this paper, we surface gaps between past research efforts that have aimed to provide automation for aspects of content moderation and the needs of volunteer content moderators, regarding identifying violations of various moderation rules. To do so, we conduct a model review on Hugging Face to reveal the availability of models to cover various moderation rules and guidelines from three exemplar forums. We further put state-of-the-art LLMs to the test, evaluating how well these models perform in flagging violations of platform rules from one particular forum. Finally, we conduct a user survey study with volunteer moderators to gain insight into their perspectives on useful moderation models. Overall, we observe a non-trivial gap, as missing developed models and LLMs exhibit moderate to low performance on a significant portion of the rules. Moderators' reports provide guides for future work on developing moderation assistant models.
Paper Structure (21 sections, 8 figures, 3 tables)

This paper contains 21 sections, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Model review annotation results. Figures (a) and (b) are the results of annotation questions rule-based? and toxic?, respectively. Figures (c) and (d) are the results of question match? for rules related to toxicity detection and not related, respectively.
  • Figure 2: Model performance on detecting question posts that violate moderation rules with GPT-4 (top) and Llama-2 (bottom) models. The x-axis is the moderation rules for question posts. Each bar pair is the precision (left) and recall (right) scores on the specific rule. The ones marked grey are the model scores that moderators would not consider useful as a moderation assistant tool. The left-most rules have good performance from both of the LLMs; the rules in the middle have good performance from at least one of the LLMs; the right-most rules do not have good enough performance from either of the LLMs.
  • Figure 3: Model performance on detecting comment posts that violate moderation rules with GPT-4 (top) and Llama-2 (bottom) models. The x-axis is the moderation rules for comment posts. Each bar pair is the precision (left) and recall (right) scores on the specific rule. The ones marked grey are the model scores that moderators would not consider useful as a moderation assistant tool. The left-most rules have good performance from both of the LLMs; the rules in the middle have good performance from at least one of the LLMs; the right-most rules do not have good enough performance from either of the LLMs.
  • Figure 4: Clustering results from the survey study. Each point is a moderation rule from r/AskHistorians. The x-axis shows the precision importance score, or how important it is for a model to have high precision for this rule. Similarly, the y-axis shows the recall importance score. Note that we removed rules for which at least three participants stated they would not use a model.
  • Figure 5: Model performance on detecting violations of moderation rules with GPT-4 model under few-shot setting (top) and zero-shot setting (bottom). The x-axis is the moderation rules. Each bar pair is the precision (left) and recall (right) scores on the specific rule.
  • ...and 3 more figures