Table of Contents
Fetching ...

Detecting Emotional Dynamic Trajectories: An Evaluation Framework for Emotional Support in Language Models

Zhouxing Tan, Ruochong Xiong, Yulong Wan, Jinlong Ma, Hanlin Xue, Qichun Deng, Haifeng Jing, Zhengtong Zhang, Depei Liu, Shiyuan Luo, Junfei Liu

TL;DR

This paper tackles the challenge of evaluating emotional support in language models over long-term, dynamic interactions. It proposes a trajectory-based framework that models user emotion with a first-order Markov process and calibrates estimates via causal interventions, introducing BEL, ETV, and ECP as core metrics. A large-scale benchmark (328 contexts, 1,152 disturbances) with emotion-regulation constraints and perturbations enables robust cross-model comparisons, revealing language- and strategy-dependent disparities and validating the approach against human judgments. The work offers practical insights for designing and evaluating emotionally intelligent language systems, with open data and reproducible procedures to advance long-term ESC research.

Abstract

Emotional support is a core capability in human-AI interaction, with applications including psychological counseling, role play, and companionship. However, existing evaluations of large language models (LLMs) often rely on short, static dialogues and fail to capture the dynamic and long-term nature of emotional support. To overcome this limitation, we shift from snapshot-based evaluation to trajectory-based assessment, adopting a user-centered perspective that evaluates models based on their ability to improve and stabilize user emotional states over time. Our framework constructs a large-scale benchmark consisting of 328 emotional contexts and 1,152 disturbance events, simulating realistic emotional shifts under evolving dialogue scenarios. To encourage psychologically grounded responses, we constrain model outputs using validated emotion regulation strategies such as situation selection and cognitive reappraisal. User emotional trajectories are modeled as a first-order Markov process, and we apply causally-adjusted emotion estimation to obtain unbiased emotional state tracking. Based on this framework, we introduce three trajectory-level metrics: Baseline Emotional Level (BEL), Emotional Trajectory Volatility (ETV), and Emotional Centroid Position (ECP). These metrics collectively capture user emotional dynamics over time and support comprehensive evaluation of long-term emotional support performance of LLMs. Extensive evaluations across a diverse set of LLMs reveal significant disparities in emotional support capabilities and provide actionable insights for model development.

Detecting Emotional Dynamic Trajectories: An Evaluation Framework for Emotional Support in Language Models

TL;DR

This paper tackles the challenge of evaluating emotional support in language models over long-term, dynamic interactions. It proposes a trajectory-based framework that models user emotion with a first-order Markov process and calibrates estimates via causal interventions, introducing BEL, ETV, and ECP as core metrics. A large-scale benchmark (328 contexts, 1,152 disturbances) with emotion-regulation constraints and perturbations enables robust cross-model comparisons, revealing language- and strategy-dependent disparities and validating the approach against human judgments. The work offers practical insights for designing and evaluating emotionally intelligent language systems, with open data and reproducible procedures to advance long-term ESC research.

Abstract

Emotional support is a core capability in human-AI interaction, with applications including psychological counseling, role play, and companionship. However, existing evaluations of large language models (LLMs) often rely on short, static dialogues and fail to capture the dynamic and long-term nature of emotional support. To overcome this limitation, we shift from snapshot-based evaluation to trajectory-based assessment, adopting a user-centered perspective that evaluates models based on their ability to improve and stabilize user emotional states over time. Our framework constructs a large-scale benchmark consisting of 328 emotional contexts and 1,152 disturbance events, simulating realistic emotional shifts under evolving dialogue scenarios. To encourage psychologically grounded responses, we constrain model outputs using validated emotion regulation strategies such as situation selection and cognitive reappraisal. User emotional trajectories are modeled as a first-order Markov process, and we apply causally-adjusted emotion estimation to obtain unbiased emotional state tracking. Based on this framework, we introduce three trajectory-level metrics: Baseline Emotional Level (BEL), Emotional Trajectory Volatility (ETV), and Emotional Centroid Position (ECP). These metrics collectively capture user emotional dynamics over time and support comprehensive evaluation of long-term emotional support performance of LLMs. Extensive evaluations across a diverse set of LLMs reveal significant disparities in emotional support capabilities and provide actionable insights for model development.

Paper Structure

This paper contains 51 sections, 40 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Overview of our evaluation framework for emotional support in long-term dialogues. It includes three modules: dynamic user-agent interaction under emotional events, causal emotion estimation based on Markov modeling, and three trajectory-level metrics including Baseline Emotional Level (BEL), Emotional Trajectory Volatility (ETV), and Emotional Centroid Position (ECP).
  • Figure 2: Causal graph illustrating emotional evolution with unobserved confounder. Right: theoretical intervention $\mathrm{do}(E_{t-1})$ removes spurious correlation via backdoor adjustment. Variable Definitions: $Q$ (User Dialogue History), $A$ (Model Dialogue History), $S$ (Emotion State), $I$ (Internal Thought), $U$(Unobserved Confounder).
  • Figure 3: Visualization of the sentiment centroid, defined as the expected emotional position under the empirical Markov model formed by the initial distribution and transition matrix $M$.
  • Figure 4: Sentiment dynamics across multi-turn dialogues under different emotional interference conditions.
  • Figure 5: distribution of user emotional distress scenarios
  • ...and 7 more figures