Table of Contents
Fetching ...

Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings

Carolin M. Schuster, Maria-Alexandra Dinisor, Shashwat Ghatiwala, Georg Groh

TL;DR

The paper addresses the challenge of communicating and mitigating biases in large language models by profiling gender-related bias using interpretable stereotype dimensions. It maps contextual embeddings into a bias space grounded in the stereotype content model (warmth and competence) and extends this with seven granular dimensions via a SensePolar-inspired projection, applying it to twelve open-source LLMs across multiple context types. The findings show consistent gender-name biases (female names linked to warmth, male names to competence) across models and layers, with context and term granularity influencing the results; nonbinary terms show weaker effects. This work provides a theory-driven, visualizable bias profiling framework that supports bias auditing and mitigation efforts, while acknowledging limitations in scope (gender, English, binary classifications) and calling for multilingual and domain-specific extensions.

Abstract

Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.

Profiling Bias in LLMs: Stereotype Dimensions in Contextual Word Embeddings

TL;DR

The paper addresses the challenge of communicating and mitigating biases in large language models by profiling gender-related bias using interpretable stereotype dimensions. It maps contextual embeddings into a bias space grounded in the stereotype content model (warmth and competence) and extends this with seven granular dimensions via a SensePolar-inspired projection, applying it to twelve open-source LLMs across multiple context types. The findings show consistent gender-name biases (female names linked to warmth, male names to competence) across models and layers, with context and term granularity influencing the results; nonbinary terms show weaker effects. This work provides a theory-driven, visualizable bias profiling framework that supports bias auditing and mitigation efforts, while acknowledging limitations in scope (gender, English, binary classifications) and calling for multilingual and domain-specific extensions.

Abstract

Large language models (LLMs) are the foundation of the current successes of artificial intelligence (AI), however, they are unavoidably biased. To effectively communicate the risks and encourage mitigation efforts these models need adequate and intuitive descriptions of their discriminatory properties, appropriate for all audiences of AI. We suggest bias profiles with respect to stereotype dimensions based on dictionaries from social psychology research. Along these dimensions we investigate gender bias in contextual embeddings, across contexts and layers, and generate stereotype profiles for twelve different LLMs, demonstrating their intuition and use case for exposing and visualizing bias.

Paper Structure

This paper contains 17 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: 7D stereotype profile for Llama-3-8B, revealing differences in embeddings of 100 female and 100 male-associated names. *Statistically significant differences (p<0.05).
  • Figure 2: Properties of context examples: Histograms of example counts, numbers of words and positions of dictionary terms within the examples.
  • Figure 3: 2D stereotype profiles for 100 female/male-associated names (left) and 9 female/male gendered terms (right). LW/HW = Low/High Warmth. LC/HC = Low/High Competence. *Dimensions with statistically significant differences (p<0.05).
  • Figure 4: 2D Stereotype profile for Llama-3-8B (see \ref{['fig:2d profiles']}) with additional projections of individual nonbinary terms. LW/HW = Low/High Warmth. LC/HC = Low/High Competence.
  • Figure 5: 7D stereotype profiles for 100 female/male-associated names (left) and 9 female/male gendered terms (right) for Llama-3.2-3B-instruct. *Statistically significant differences (p<0.05).
  • ...and 1 more figures