MIST: Towards Multi-dimensional Implicit BiaS Evaluation of LLMs via Theory of Mind
Yanlin Li, Hao Liu, Huimin Liu, Kun Wang, Yinwei Wei, Yupeng Hu
TL;DR
This work reframes bias in large language models as a Theory of Mind (ToM) failure by adopting the Stereotype Content Model (SCM) to a multidimensional analysis along Competence, Sociability, and Morality. It introduces two indirect tasks, the Word Association Bias Test (WABT) and the Affective Attribution Test (AAT), to elicit latent stereotypes while avoiding explicit bias queries. Through evaluations on eight state-of-the-art LLMs, the study uncovers robust sociability biases, multidimensional divergence, and asymmetries in affective attributions, highlighting complex structural biases that direct queries may miss. The framework enables finer-grained bias diagnostics with potential implications for debiasing strategies and improving ToM-like reasoning in LLMs across diverse social contexts.
Abstract
Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework's capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.
