Table of Contents
Fetching ...

MIST: Towards Multi-dimensional Implicit BiaS Evaluation of LLMs via Theory of Mind

Yanlin Li, Hao Liu, Huimin Liu, Kun Wang, Yinwei Wei, Yupeng Hu

TL;DR

This work reframes bias in large language models as a Theory of Mind (ToM) failure by adopting the Stereotype Content Model (SCM) to a multidimensional analysis along Competence, Sociability, and Morality. It introduces two indirect tasks, the Word Association Bias Test (WABT) and the Affective Attribution Test (AAT), to elicit latent stereotypes while avoiding explicit bias queries. Through evaluations on eight state-of-the-art LLMs, the study uncovers robust sociability biases, multidimensional divergence, and asymmetries in affective attributions, highlighting complex structural biases that direct queries may miss. The framework enables finer-grained bias diagnostics with potential implications for debiasing strategies and improving ToM-like reasoning in LLMs across diverse social contexts.

Abstract

Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework's capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.

MIST: Towards Multi-dimensional Implicit BiaS Evaluation of LLMs via Theory of Mind

TL;DR

This work reframes bias in large language models as a Theory of Mind (ToM) failure by adopting the Stereotype Content Model (SCM) to a multidimensional analysis along Competence, Sociability, and Morality. It introduces two indirect tasks, the Word Association Bias Test (WABT) and the Affective Attribution Test (AAT), to elicit latent stereotypes while avoiding explicit bias queries. Through evaluations on eight state-of-the-art LLMs, the study uncovers robust sociability biases, multidimensional divergence, and asymmetries in affective attributions, highlighting complex structural biases that direct queries may miss. The framework enables finer-grained bias diagnostics with potential implications for debiasing strategies and improving ToM-like reasoning in LLMs across diverse social contexts.

Abstract

Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework's capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.

Paper Structure

This paper contains 27 sections, 3 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Example illustrating how stereotypes from training corpora are internalized by LLMs, manifest as group-level stereotypes, propagate to individual bias, and ultimately lead to failures in ToM.
  • Figure 2: The pipeline of the evaluation methodology.
  • Figure 3: The radar charts illustrate the average bias scores of 8 LLMs across 3 stereotype dimensions. Each axis represents a specific social group, and the radial values indicate the direction and magnitude of the model's bias toward that group.
  • Figure 4: Visualization of the rankings of the 8 evaluated LLMs across the dimensions of Competence, Sociability, and Morality in WABT task. The shifting ranks highlight that a model's bias tendencies are inconsistent across different stereotype dimensions.
  • Figure 5: Comparison of emotional framing tendencies among 5 LLMs on the AAT. The bars represent the percentage of each model's responses categorized as Comedy, Tragedy, or Neutrality.
  • ...and 4 more figures