Table of Contents
Fetching ...

Watching the Watchers: A Comparative Fairness Audit of Cloud-based Content Moderation Services

David Hartmann, Amin Oueslati, Dimitri Staufer

TL;DR

This paper tackles the lack of external accountability in cloud-based content moderation by conducting a black-box audit of four major moderation APIs across four hate-speech datasets. It employs threshold-based and threshold-invariant metrics, along with perturbation sensitivity analysis for counterfactual fairness, and augments the analysis with a Bi-LSTM-derived identity classifier on MegaSpeech to assess group fairness across seven vulnerable groups. Key findings show pervasive difficulties in detecting implicit hate speech, varied cross-service performance, and persistent biases against LGBTQ+ and PoC (with Women biases largely addressed). The work highlights the limitations of relying solely on cloud services for moderation and underscores the need for robust fairness safeguards and human oversight in deployment and policy design.

Abstract

Online platforms face the challenge of moderating an ever-increasing volume of content, including harmful hate speech. In the absence of clear legal definitions and a lack of transparency regarding the role of algorithms in shaping decisions on content moderation, there is a critical need for external accountability. Our study contributes to filling this gap by systematically evaluating four leading cloud-based content moderation services through a third-party audit, highlighting issues such as biases against minorities and vulnerable groups that may arise through over-reliance on these services. Using a black-box audit approach and four benchmark data sets, we measure performance in explicit and implicit hate speech detection as well as counterfactual fairness through perturbation sensitivity analysis and present disparities in performance for certain target identity groups and data sets. Our analysis reveals that all services had difficulties detecting implicit hate speech, which relies on more subtle and codified messages. Moreover, our results point to the need to remove group-specific bias. It seems that biases towards some groups, such as Women, have been mostly rectified, while biases towards other groups, such as LGBTQ+ and PoC remain.

Watching the Watchers: A Comparative Fairness Audit of Cloud-based Content Moderation Services

TL;DR

This paper tackles the lack of external accountability in cloud-based content moderation by conducting a black-box audit of four major moderation APIs across four hate-speech datasets. It employs threshold-based and threshold-invariant metrics, along with perturbation sensitivity analysis for counterfactual fairness, and augments the analysis with a Bi-LSTM-derived identity classifier on MegaSpeech to assess group fairness across seven vulnerable groups. Key findings show pervasive difficulties in detecting implicit hate speech, varied cross-service performance, and persistent biases against LGBTQ+ and PoC (with Women biases largely addressed). The work highlights the limitations of relying solely on cloud services for moderation and underscores the need for robust fairness safeguards and human oversight in deployment and policy design.

Abstract

Online platforms face the challenge of moderating an ever-increasing volume of content, including harmful hate speech. In the absence of clear legal definitions and a lack of transparency regarding the role of algorithms in shaping decisions on content moderation, there is a critical need for external accountability. Our study contributes to filling this gap by systematically evaluating four leading cloud-based content moderation services through a third-party audit, highlighting issues such as biases against minorities and vulnerable groups that may arise through over-reliance on these services. Using a black-box audit approach and four benchmark data sets, we measure performance in explicit and implicit hate speech detection as well as counterfactual fairness through perturbation sensitivity analysis and present disparities in performance for certain target identity groups and data sets. Our analysis reveals that all services had difficulties detecting implicit hate speech, which relies on more subtle and codified messages. Moreover, our results point to the need to remove group-specific bias. It seems that biases towards some groups, such as Women, have been mostly rectified, while biases towards other groups, such as LGBTQ+ and PoC remain.
Paper Structure (3 sections, 1 figure, 1 table)

This paper contains 3 sections, 1 figure, 1 table.

Figures (1)

  • Figure 1: On the left, Pinned ROC AUC is presented by moderation service, dataset and minority group. ToxiGen includes 4,268 observations, HateXplain includes 1,748, Jigsaw consists of 19,228 observations and MegaSpeech is comprised of 33,886. On the right, CFT scores are visualized. They are computed through PSA on synthetic data from the Identity Phrase Templates in dixon_measuring_2018 and non-synthetic data from MegaSpeech, averaged per group and service, reported separately for non-toxic and toxic examples. Besides a point estimate, the figure also includes a 95% confidence interval assuming a student-t distribution.