Are you sure? Measuring models bias in content moderation through uncertainty
Alessandra Urbinati, Mirko Lai, Simona Frenda, Marco Antonio Stranisci
TL;DR
This work targets biases in automatic content moderation by measuring model uncertainty through conformal prediction. It introduces Uncertainty Divergence and Demographic Divergence to compare how predictions align with annotators from different socio-demographic groups, using the Brier score as a conformity metric and Conformity Delta to capture per-annotator disagreement. Across 11 models and two hate-speech corpora, the study finds that some models achieve high accuracy on minority-labeled content but exhibit higher uncertainty for these groups, indicating biases not visible through $F_1$ alone. The findings suggest uncertainty-driven representations can guide debiasing and more inclusive model deployment, while acknowledging limitations such as binary demographics and corpus-specific effects and proposing uncertainty-aware training and transfer to other perspectivist tasks in future work.
Abstract
Automatic content moderation is crucial to ensuring safety in social media. Language Model-based classifiers are being increasingly adopted for this task, but it has been shown that they perpetuate racial and social biases. Even if several resources and benchmark corpora have been developed to challenge this issue, measuring the fairness of models in content moderation remains an open issue. In this work, we present an unsupervised approach that benchmarks models on the basis of their uncertainty in classifying messages annotated by people belonging to vulnerable groups. We use uncertainty, computed by means of the conformal prediction technique, as a proxy to analyze the bias of 11 models against women and non-white annotators and observe to what extent it diverges from metrics based on performance, such as the $F_1$ score. The results show that some pre-trained models predict with high accuracy the labels coming from minority groups, even if the confidence in their prediction is low. Therefore, by measuring the confidence of models, we are able to see which groups of annotators are better represented in pre-trained models and lead the debiasing process of these models before their effective use.
