Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study
Ayan Majumdar, Feihao Chen, Jinghui Li, Xiaozhen Wang
TL;DR
This work addresses the regulatory and technical need to audit demographic-targeted social biases in large-scale text by introducing a multi-axis, multi-label evaluation framework for LLMs. It builds a nine-axis taxonomy aligned with anti-discrimination principles, adapts twelve diverse datasets into a unified benchmark, and evaluates prompting, in-context learning, and fine-tuning across a range of model sizes and architectures. Key findings show that fine-tuned encoder models can achieve strong bias-detection performance with better parity across demographics, while prompting-based methods benefit from few-shot in-context learning and larger models but exhibit notable disparities, especially in multi-axis cases. The study provides practical guidance for scalable bias auditing and highlights remaining gaps in intersectional bias detection, underscoring the need for more nuanced, multilingual, and ethically guided auditing frameworks.
Abstract
Large-scale web-scraped text corpora used to train general-purpose AI models often contain harmful demographic-targeted social biases, creating a regulatory need for data auditing and developing scalable bias-detection methods. Although prior work has investigated biases in text datasets and related detection methods, these studies remain narrow in scope. They typically focus on a single content type (e.g., hate speech), cover limited demographic axes, overlook biases affecting multiple demographics simultaneously, and analyze limited techniques. Consequently, practitioners lack a holistic understanding of the strengths and limitations of recent large language models (LLMs) for automated bias detection. In this study, we present a comprehensive evaluation framework aimed at English texts to assess the ability of LLMs in detecting demographic-targeted social biases. To align with regulatory requirements, we frame bias detection as a multi-label task using a demographic-focused taxonomy. We then conduct a systematic evaluation with models across scales and techniques, including prompting, in-context learning, and fine-tuning. Using twelve datasets spanning diverse content types and demographics, our study demonstrates the promise of fine-tuned smaller models for scalable detection. However, our analyses also expose persistent gaps across demographic axes and multi-demographic targeted biases, underscoring the need for more effective and scalable auditing frameworks.
