Table of Contents
Fetching ...

Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic Variations

David Hartmann, Amin Oueslati, Dimitri Staufer, Lena Pohlmann, Simon Munzert, Hendrik Heuer

TL;DR

This work tackles the risk that commercial content moderation APIs may over-moderate legitimate speech while under-moderating harmful content, especially for marginalized groups. It introduces a reproducible black-box audit framework and applies it to five major APIs across four datasets, analyzing over $5$ million queries. Key findings show heavy reliance on identity tokens leading to over-moderation of counter-speech and under-detection of implicit hate, with significant group- and dataset-dependent variation; OpenAI and Amazon are the most balanced, but all services exhibit notable biases. The paper contributes a scalable auditing methodology, a set of design and policy recommendations for transparency and threshold guidance, and a case for independent audits to improve accountability and trust in AI-driven content moderation.

Abstract

Commercial content moderation APIs are marketed as scalable solutions to combat online hate speech. However, the reliance on these APIs risks both silencing legitimate speech, called over-moderation, and failing to protect online platforms from harmful speech, known as under-moderation. To assess such risks, this paper introduces a framework for auditing black-box NLP systems. Using the framework, we systematically evaluate five widely used commercial content moderation APIs. Analyzing five million queries based on four datasets, we find that APIs frequently rely on group identity terms, such as ``black'', to predict hate speech. While OpenAI's and Amazon's services perform slightly better, all providers under-moderate implicit hate speech, which uses codified messages, especially against LGBTQIA+ individuals. Simultaneously, they over-moderate counter-speech, reclaimed slurs and content related to Black, LGBTQIA+, Jewish, and Muslim people. We recommend that API providers offer better guidance on API implementation and threshold setting and more transparency on their APIs' limitations. Warning: This paper contains offensive and hateful terms and concepts. We have chosen to reproduce these terms for reasons of transparency.

Lost in Moderation: How Commercial Content Moderation APIs Over- and Under-Moderate Group-Targeted Hate Speech and Linguistic Variations

TL;DR

This work tackles the risk that commercial content moderation APIs may over-moderate legitimate speech while under-moderating harmful content, especially for marginalized groups. It introduces a reproducible black-box audit framework and applies it to five major APIs across four datasets, analyzing over million queries. Key findings show heavy reliance on identity tokens leading to over-moderation of counter-speech and under-detection of implicit hate, with significant group- and dataset-dependent variation; OpenAI and Amazon are the most balanced, but all services exhibit notable biases. The paper contributes a scalable auditing methodology, a set of design and policy recommendations for transparency and threshold guidance, and a case for independent audits to improve accountability and trust in AI-driven content moderation.

Abstract

Commercial content moderation APIs are marketed as scalable solutions to combat online hate speech. However, the reliance on these APIs risks both silencing legitimate speech, called over-moderation, and failing to protect online platforms from harmful speech, known as under-moderation. To assess such risks, this paper introduces a framework for auditing black-box NLP systems. Using the framework, we systematically evaluate five widely used commercial content moderation APIs. Analyzing five million queries based on four datasets, we find that APIs frequently rely on group identity terms, such as ``black'', to predict hate speech. While OpenAI's and Amazon's services perform slightly better, all providers under-moderate implicit hate speech, which uses codified messages, especially against LGBTQIA+ individuals. Simultaneously, they over-moderate counter-speech, reclaimed slurs and content related to Black, LGBTQIA+, Jewish, and Muslim people. We recommend that API providers offer better guidance on API implementation and threshold setting and more transparency on their APIs' limitations. Warning: This paper contains offensive and hateful terms and concepts. We have chosen to reproduce these terms for reasons of transparency.

Paper Structure

This paper contains 36 sections, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Our black-box audit framework to evaluate commercial content moderation APIs.
  • Figure 2: The pipeline of content moderation APIs, exemplary illustration with a blog post.
  • Figure 3: Perturbation Sensitivity Analysis on synthetic data from the Identity Phrase Templates in dixon_measuring_2018 and non-synthetic data from HateXplain. Counterfactual Token Fairness (CTF) scores are computed as the difference in toxicity between the phrase containing the baseline dominant token and its marginalized perturbation. Counterfactual Token Fairness scores per marginalized group and service are averaged and reported for non-toxic . Besides a point estimate, the figure also includes a 95% confidence interval assuming a student-t distribution.
  • Figure 4: FP examples: SHAP value visualizations for examples from the ToxiGen and HateXplain datasets using Amazon Comprehend and OpenAI. Red indicates a strong contribution to deciding hate speech; blue indicates a strong contribution to deciding non-hate speech.
  • Figure 5: FN examples: SHAP value visualizations for examples from the ToxiGen and HateXplain datasets. Red indicates a strong contribution to deciding hate speech; blue indicates a strong contribution to deciding non-hate speech. For visualization, we added some tokens together and averaged contribution of both.