Table of Contents
Fetching ...

Leveraging Implicit Sentiments: Enhancing Reliability and Validity in Psychological Trait Evaluation of LLMs

Huanhuan Ma, Haisong Gong, Xiaoyuan Yi, Xing Xie, Dongkuan Xu

TL;DR

This work tackles reliability and validity gaps in psychometric evaluations of large language models by introducing the Core Sentiment Inventory (CSI), an implicit, bilingual instrument inspired by the Implicit Association Test. CSI uses a 5,000-item neutral English-Chinese stimulus set to quantify model sentiment across optimism, pessimism, and neutrality, reducing reliance on self-reports and human-centered scales. Empirical results show CSI yields higher reliability than traditional measures, with up to 45% greater consistency and near-zero reluctance, and strong predictive validity demonstrated by a downstream story generation task (correlation > 0.85 with real-world outputs). The approach reveals nuanced cross-lingual sentiment patterns across mainstream and open models, and its public release enables broader evaluation and refinement of emotional alignment in AI systems.

Abstract

Recent advancements in Large Language Models (LLMs) have led to their increasing integration into human life. With the transition from mere tools to human-like assistants, understanding their psychological aspects-such as emotional tendencies and personalities-becomes essential for ensuring their trustworthiness. However, current psychological evaluations of LLMs, often based on human psychological assessments like the BFI, face significant limitations. The results from these approaches often lack reliability and have limited validity when predicting LLM behavior in real-world scenarios. In this work, we introduce a novel evaluation instrument specifically designed for LLMs, called Core Sentiment Inventory (CSI). CSI is a bilingual tool, covering both English and Chinese, that implicitly evaluates models' sentiment tendencies, providing an insightful psychological portrait of LLM across three dimensions: optimism, pessimism, and neutrality. Through extensive experiments, we demonstrate that: 1) CSI effectively captures nuanced emotional patterns, revealing significant variation in LLMs across languages and contexts; 2) Compared to current approaches, CSI significantly improves reliability, yielding more consistent results; and 3) The correlation between CSI scores and the sentiment of LLM's real-world outputs exceeds 0.85, demonstrating its strong validity in predicting LLM behavior. We make CSI public available via: https://github.com/dependentsign/CSI.

Leveraging Implicit Sentiments: Enhancing Reliability and Validity in Psychological Trait Evaluation of LLMs

TL;DR

This work tackles reliability and validity gaps in psychometric evaluations of large language models by introducing the Core Sentiment Inventory (CSI), an implicit, bilingual instrument inspired by the Implicit Association Test. CSI uses a 5,000-item neutral English-Chinese stimulus set to quantify model sentiment across optimism, pessimism, and neutrality, reducing reliance on self-reports and human-centered scales. Empirical results show CSI yields higher reliability than traditional measures, with up to 45% greater consistency and near-zero reluctance, and strong predictive validity demonstrated by a downstream story generation task (correlation > 0.85 with real-world outputs). The approach reveals nuanced cross-lingual sentiment patterns across mainstream and open models, and its public release enables broader evaluation and refinement of emotional alignment in AI systems.

Abstract

Recent advancements in Large Language Models (LLMs) have led to their increasing integration into human life. With the transition from mere tools to human-like assistants, understanding their psychological aspects-such as emotional tendencies and personalities-becomes essential for ensuring their trustworthiness. However, current psychological evaluations of LLMs, often based on human psychological assessments like the BFI, face significant limitations. The results from these approaches often lack reliability and have limited validity when predicting LLM behavior in real-world scenarios. In this work, we introduce a novel evaluation instrument specifically designed for LLMs, called Core Sentiment Inventory (CSI). CSI is a bilingual tool, covering both English and Chinese, that implicitly evaluates models' sentiment tendencies, providing an insightful psychological portrait of LLM across three dimensions: optimism, pessimism, and neutrality. Through extensive experiments, we demonstrate that: 1) CSI effectively captures nuanced emotional patterns, revealing significant variation in LLMs across languages and contexts; 2) Compared to current approaches, CSI significantly improves reliability, yielding more consistent results; and 3) The correlation between CSI scores and the sentiment of LLM's real-world outputs exceeds 0.85, demonstrating its strong validity in predicting LLM behavior. We make CSI public available via: https://github.com/dependentsign/CSI.

Paper Structure

This paper contains 35 sections, 5 figures, 18 tables.

Figures (5)

  • Figure 1: Reliability issues in current psychometric evaluation methods for LLMs.
  • Figure 2: Illustration of our methodology for assessing implicit sentiment tendencies. The process begins with sampling words from CSI as stimuli. The model's responses are then used to compute a numerical CSI Score across optimism, pessimism, and neutrality. Finally, each type of stimulus is provided for qualitative analysis.
  • Figure 3: Prompt template used perform IAT.
  • Figure 4: Correlation between Pessimism Scores in Generated Stories and CSI Scores Across Different Models and Languages.
  • Figure 5: Inconsistency in BFI scores across different GPT models and prompt settings.