Algorithmic Arbitrariness in Content Moderation
Juan Felipe Gomez, Caio Vieira Machado, Lucas Monteiro Paes, Flavio P. Calmon
TL;DR
The paper investigates predictive multiplicity in algorithmic content moderation, showing that competing toxicity detectors can disagree on the same text, producing arbitrary moderation outcomes. It models the set of near-equally-accurate models as a Rashomon set $\\mathcal{R}(\\epsilon, h_{ref})$ and quantifies arbitrariness $\\widehat{\\mathbb{A}}(\\epsilon)$ and pairwise disagreement $\\widehat{\\text{PD}}_{\\epsilon}(\\mathbf{x})$ across multiple datasets, observing substantial arbitrariness (e.g., around $34\%$ for some state-of-the-art detectors) and non-negligible disagreement. The work further reveals that arbitrariness is not uniformly distributed across content topics or demographic groups, evidencing disparate impacts and challenging non-discrimination and procedural justice principles under ICCPR; humans and machines also disagree in varying degrees, including in cases humans deem clear. Through these findings, the paper argues for greater transparency, accountability, and governance around algorithmic content moderation to avoid an “algorithmic leviathan” that disproportionately governs speech rights and to inform policy debates in regimes like the DSA and Online Safety Acts. It concludes with a path forward emphasizing rule-based, explainable moderation and safeguards to mitigate harms while accommodating scalable automated moderation.
Abstract
Machine learning (ML) is widely used to moderate online content. Despite its scalability relative to human moderation, the use of ML introduces unique challenges to content moderation. One such challenge is predictive multiplicity: multiple competing models for content classification may perform equally well on average, yet assign conflicting predictions to the same content. This multiplicity can result from seemingly innocuous choices during model development, such as random seed selection for parameter initialization. We experimentally demonstrate how content moderation tools can arbitrarily classify samples as toxic, leading to arbitrary restrictions on speech. We discuss these findings in terms of human rights set out by the International Covenant on Civil and Political Rights (ICCPR), namely freedom of expression, non-discrimination, and procedural justice. We analyze (i) the extent of predictive multiplicity among state-of-the-art LLMs used for detecting toxic content; (ii) the disparate impact of this arbitrariness across social groups; and (iii) how model multiplicity compares to unambiguous human classifications. Our findings indicate that the up-scaled algorithmic moderation risks legitimizing an algorithmic leviathan, where an algorithm disproportionately manages human rights. To mitigate such risks, our study underscores the need to identify and increase the transparency of arbitrariness in content moderation applications. Since algorithmic content moderation is being fueled by pressing social concerns, such as disinformation and hate speech, our discussion on harms raises concerns relevant to policy debates. Our findings also contribute to content moderation and intermediary liability laws being discussed and passed in many countries, such as the Digital Services Act in the European Union, the Online Safety Act in the United Kingdom, and the Fake News Bill in Brazil.
