Table of Contents
Fetching ...

Large language models show fragile cognitive reasoning about human emotions

Sree Bhattacharyya, Evgenii Kuriabov, Lucas Craig, Tharun Dilliraj, Reginald B. Adams,, Jia Li, James Z. Wang

Abstract

Affective computing seeks to support the holistic development of artificial intelligence by enabling machines to engage with human emotion. Recent foundation models, particularly large language models (LLMs), have been trained and evaluated on emotion-related tasks, typically using supervised learning with discrete emotion labels. Such evaluations largely focus on surface phenomena, such as recognizing expressed or evoked emotions, leaving open whether these systems reason about emotion in cognitively meaningful ways. Here we ask whether LLMs can reason about emotions through underlying cognitive dimensions rather than labels alone. Drawing on cognitive appraisal theory, we introduce CoRE, a large-scale benchmark designed to probe the implicit cognitive structures LLMs use when interpreting emotionally charged situations. We assess alignment with human appraisal patterns, internal consistency, cross-model generalization, and robustness to contextual variation. We find that LLMs capture systematic relations between cognitive appraisals and emotions but show misalignment with human judgments and instability across contexts.

Large language models show fragile cognitive reasoning about human emotions

Abstract

Affective computing seeks to support the holistic development of artificial intelligence by enabling machines to engage with human emotion. Recent foundation models, particularly large language models (LLMs), have been trained and evaluated on emotion-related tasks, typically using supervised learning with discrete emotion labels. Such evaluations largely focus on surface phenomena, such as recognizing expressed or evoked emotions, leaving open whether these systems reason about emotion in cognitively meaningful ways. Here we ask whether LLMs can reason about emotions through underlying cognitive dimensions rather than labels alone. Drawing on cognitive appraisal theory, we introduce CoRE, a large-scale benchmark designed to probe the implicit cognitive structures LLMs use when interpreting emotionally charged situations. We assess alignment with human appraisal patterns, internal consistency, cross-model generalization, and robustness to contextual variation. We find that LLMs capture systematic relations between cognitive appraisals and emotions but show misalignment with human judgments and instability across contexts.

Paper Structure

This paper contains 24 sections, 4 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Implicitly and explicitly important dimensions of cognitive appraisal. a, Implicit importance of cognitive appraisal dimensions, as demonstrated through the top 6 principal components. b, Explicitly reported top-6 appraisal dimensions across different models.
  • Figure 1: Varimax Loadings for the top-6 principal components. a, b, c, d, e, f. The loadings are shown for DeepSeek R1 (a), GPT-o4-mini (b), Gemini 2.5 Flash (c), QwQ 32B d, Phi 4 14B (e), LLaMA 3 8B (f).
  • Figure 2: Wasserstein distances between emotions. a, b, c, d, e, f, Distances between appraisal distributions of different emotions, as obtained from DeepSeek R1 (a), GPT-o4 (b), Gemini 2.5 Flash (c), QwQ 32B (d), Phi 4 14B (e), LLaMA 3 8B (f).
  • Figure 2: All feature coefficients as found from the analysis with logistic regression, across all models and emotion classes.
  • Figure 3: Cross-model comparison of appraisal distributions. a, Average pairwise correlation of ratings provided by models, for each appraisal dimension, with the standard deviation. b, P-Values of the Maximum Mean Discrepancy (MMD) test across distributions from different models, for the same emotion. The range of values depicted is [0,1]. Darker cells indicate values closer to 1, and lighter cells indicate values closer to 0. A recorded p-value of $<$ 0.05 denotes statistical significance for the hypothesis that the compared distributions are different from each other.
  • ...and 5 more figures