Table of Contents
Fetching ...

Algorithmic Arbitrariness in Content Moderation

Juan Felipe Gomez, Caio Vieira Machado, Lucas Monteiro Paes, Flavio P. Calmon

TL;DR

The paper investigates predictive multiplicity in algorithmic content moderation, showing that competing toxicity detectors can disagree on the same text, producing arbitrary moderation outcomes. It models the set of near-equally-accurate models as a Rashomon set $\\mathcal{R}(\\epsilon, h_{ref})$ and quantifies arbitrariness $\\widehat{\\mathbb{A}}(\\epsilon)$ and pairwise disagreement $\\widehat{\\text{PD}}_{\\epsilon}(\\mathbf{x})$ across multiple datasets, observing substantial arbitrariness (e.g., around $34\%$ for some state-of-the-art detectors) and non-negligible disagreement. The work further reveals that arbitrariness is not uniformly distributed across content topics or demographic groups, evidencing disparate impacts and challenging non-discrimination and procedural justice principles under ICCPR; humans and machines also disagree in varying degrees, including in cases humans deem clear. Through these findings, the paper argues for greater transparency, accountability, and governance around algorithmic content moderation to avoid an “algorithmic leviathan” that disproportionately governs speech rights and to inform policy debates in regimes like the DSA and Online Safety Acts. It concludes with a path forward emphasizing rule-based, explainable moderation and safeguards to mitigate harms while accommodating scalable automated moderation.

Abstract

Machine learning (ML) is widely used to moderate online content. Despite its scalability relative to human moderation, the use of ML introduces unique challenges to content moderation. One such challenge is predictive multiplicity: multiple competing models for content classification may perform equally well on average, yet assign conflicting predictions to the same content. This multiplicity can result from seemingly innocuous choices during model development, such as random seed selection for parameter initialization. We experimentally demonstrate how content moderation tools can arbitrarily classify samples as toxic, leading to arbitrary restrictions on speech. We discuss these findings in terms of human rights set out by the International Covenant on Civil and Political Rights (ICCPR), namely freedom of expression, non-discrimination, and procedural justice. We analyze (i) the extent of predictive multiplicity among state-of-the-art LLMs used for detecting toxic content; (ii) the disparate impact of this arbitrariness across social groups; and (iii) how model multiplicity compares to unambiguous human classifications. Our findings indicate that the up-scaled algorithmic moderation risks legitimizing an algorithmic leviathan, where an algorithm disproportionately manages human rights. To mitigate such risks, our study underscores the need to identify and increase the transparency of arbitrariness in content moderation applications. Since algorithmic content moderation is being fueled by pressing social concerns, such as disinformation and hate speech, our discussion on harms raises concerns relevant to policy debates. Our findings also contribute to content moderation and intermediary liability laws being discussed and passed in many countries, such as the Digital Services Act in the European Union, the Online Safety Act in the United Kingdom, and the Fake News Bill in Brazil.

Algorithmic Arbitrariness in Content Moderation

TL;DR

The paper investigates predictive multiplicity in algorithmic content moderation, showing that competing toxicity detectors can disagree on the same text, producing arbitrary moderation outcomes. It models the set of near-equally-accurate models as a Rashomon set and quantifies arbitrariness and pairwise disagreement across multiple datasets, observing substantial arbitrariness (e.g., around for some state-of-the-art detectors) and non-negligible disagreement. The work further reveals that arbitrariness is not uniformly distributed across content topics or demographic groups, evidencing disparate impacts and challenging non-discrimination and procedural justice principles under ICCPR; humans and machines also disagree in varying degrees, including in cases humans deem clear. Through these findings, the paper argues for greater transparency, accountability, and governance around algorithmic content moderation to avoid an “algorithmic leviathan” that disproportionately governs speech rights and to inform policy debates in regimes like the DSA and Online Safety Acts. It concludes with a path forward emphasizing rule-based, explainable moderation and safeguards to mitigate harms while accommodating scalable automated moderation.

Abstract

Machine learning (ML) is widely used to moderate online content. Despite its scalability relative to human moderation, the use of ML introduces unique challenges to content moderation. One such challenge is predictive multiplicity: multiple competing models for content classification may perform equally well on average, yet assign conflicting predictions to the same content. This multiplicity can result from seemingly innocuous choices during model development, such as random seed selection for parameter initialization. We experimentally demonstrate how content moderation tools can arbitrarily classify samples as toxic, leading to arbitrary restrictions on speech. We discuss these findings in terms of human rights set out by the International Covenant on Civil and Political Rights (ICCPR), namely freedom of expression, non-discrimination, and procedural justice. We analyze (i) the extent of predictive multiplicity among state-of-the-art LLMs used for detecting toxic content; (ii) the disparate impact of this arbitrariness across social groups; and (iii) how model multiplicity compares to unambiguous human classifications. Our findings indicate that the up-scaled algorithmic moderation risks legitimizing an algorithmic leviathan, where an algorithm disproportionately manages human rights. To mitigate such risks, our study underscores the need to identify and increase the transparency of arbitrariness in content moderation applications. Since algorithmic content moderation is being fueled by pressing social concerns, such as disinformation and hate speech, our discussion on harms raises concerns relevant to policy debates. Our findings also contribute to content moderation and intermediary liability laws being discussed and passed in many countries, such as the Digital Services Act in the European Union, the Online Safety Act in the United Kingdom, and the Fake News Bill in Brazil.
Paper Structure (49 sections, 5 equations, 8 figures, 7 tables)

This paper contains 49 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Average pairwise disagreement and arbitrariness in different target groups for the fine-tuned Toxigen and Jigsaw models. The results show the pairwise disagreement in percentage (x-axis) for the union of four different datasets: DynaHate, SBF, Toxigen, and HateExplain. The results are shown for training and test partitions of each dataset. The confidence in the CP methods was chosen to be $95\%$.
  • Figure 2: Average pairwise disagreement and arbitrariness for Clear and Unclear sentences using the Toxigen fine-tuned and Jigsaw fine-tuned models. The figure shows the pairwise disagreement estimated values along with the $95\%$ confidence intervals using the standard error from the mean. We consider a sentence Unclear when at least one annotator labeled the sentence differently than others and Clear otherwise. The confidence in the CP methods was chosen to be $95\%$.
  • Figure 3: Screenshot of the HuggingFace platform's most popular toxic detection models as of the writing of this paper
  • Figure 4: Training trajectories for the fine-tuned ToxiGen and Jigsaw models over 10 randomly chosen seeds.
  • Figure 5: Average pairwise disagreement and arbitrariness in different target groups for the fine-tuned Toxigen and Jigsaw models. The results show the pairwise disagreement in percentage (x-axis) for the union of four different datasets: DynaHate, SBF, Toxigen, and HateExplain. The results are shown for training and test partitions of each dataset. The confidence in the CP methods was chosen to be $50\%$ containing all fine-tuned models, leading to the selection of 38 out of 40 Roberta models in the Rashomon set fine-tuned in the Toxigen dataset and 17 out of 20 Jigsaw fine-tuned models.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Definition 1: Arbitrariness
  • Definition 2: Pairwise Disagreement black2022modeld2022underspecification