Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings
Carolin M. Schuster, Maria-Alexandra Dinisor, Shashwat Ghatiwala, Georg Groh
TL;DR
The paper addresses the challenge of communicating and mitigating biases in large language models by profiling gender-related bias using interpretable stereotype dimensions. It maps contextual embeddings into a bias space grounded in the stereotype content model (warmth and competence) and extends this with seven granular dimensions via a SensePolar-inspired projection, applying it to twelve open-source LLMs across multiple context types. The findings show consistent gender-name biases (female names linked to warmth, male names to competence) across models and layers, with context and term granularity influencing the results; nonbinary terms show weaker effects. This work provides a theory-driven, visualizable bias profiling framework that supports bias auditing and mitigation efforts, while acknowledging limitations in scope (gender, English, binary classifications) and calling for multilingual and domain-specific extensions.
Abstract
Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.
