Table of Contents
Fetching ...

Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM-Generated CBT Sessions

Xiaoyi Wang, Jiwei Zhang, Guangtao Zhang, Honglei Guo

TL;DR

This study critically evaluates whether synthetic, LLM-generated CBT dialogues reproduce the nuanced emotional dynamics observed in real therapy. By introducing RealCBT and adapting the Utterance Emotion Dynamics framework, it benchmarks real versus synthetic sessions using a detailed, lexicon-based analysis of valence, arousal, and dominance. Key findings show real sessions exhibit greater emotional variability and authentic reactivity, while synthetic data display higher mean affect and less dynamic arc structure, with especially weak alignment for client trajectories. The work provides an empirical benchmark and publicly releases RealCBT to guide future development of emotionally faithful, clinically credible dialogue systems for mental health.

Abstract

Synthetic therapy dialogues generated by large language models (LLMs) are increasingly used in mental health NLP to simulate counseling scenarios, train models, and supplement limited real-world data. However, it remains unclear whether these synthetic conversations capture the nuanced emotional dynamics of real therapy. In this work, we introduce RealCBT, a dataset of authentic cognitive behavioral therapy (CBT) dialogues, and conduct the first comparative analysis of emotional arcs between real and LLM-generated CBT sessions. We adapt the Utterance Emotion Dynamics framework to analyze fine-grained affective trajectories across valence, arousal, and dominance dimensions. Our analysis spans both full dialogues and individual speaker roles (counselor and client), using real sessions from the RealCBT dataset and synthetic dialogues from the CACTUS dataset. We find that while synthetic dialogues are fluent and structurally coherent, they diverge from real conversations in key emotional properties: real sessions exhibit greater emotional variability, more emotion-laden language, and more authentic patterns of reactivity and regulation. Moreover, emotional arc similarity remains low across all pairings, with especially weak alignment between real and synthetic speakers. These findings underscore the limitations of current LLM-generated therapy data and highlight the importance of emotional fidelity in mental health applications. To support future research, our dataset RealCBT is released at https://gitlab.com/xiaoyi.wang/realcbt-dataset.

Feel the Difference? A Comparative Analysis of Emotional Arcs in Real and LLM-Generated CBT Sessions

TL;DR

This study critically evaluates whether synthetic, LLM-generated CBT dialogues reproduce the nuanced emotional dynamics observed in real therapy. By introducing RealCBT and adapting the Utterance Emotion Dynamics framework, it benchmarks real versus synthetic sessions using a detailed, lexicon-based analysis of valence, arousal, and dominance. Key findings show real sessions exhibit greater emotional variability and authentic reactivity, while synthetic data display higher mean affect and less dynamic arc structure, with especially weak alignment for client trajectories. The work provides an empirical benchmark and publicly releases RealCBT to guide future development of emotionally faithful, clinically credible dialogue systems for mental health.

Abstract

Synthetic therapy dialogues generated by large language models (LLMs) are increasingly used in mental health NLP to simulate counseling scenarios, train models, and supplement limited real-world data. However, it remains unclear whether these synthetic conversations capture the nuanced emotional dynamics of real therapy. In this work, we introduce RealCBT, a dataset of authentic cognitive behavioral therapy (CBT) dialogues, and conduct the first comparative analysis of emotional arcs between real and LLM-generated CBT sessions. We adapt the Utterance Emotion Dynamics framework to analyze fine-grained affective trajectories across valence, arousal, and dominance dimensions. Our analysis spans both full dialogues and individual speaker roles (counselor and client), using real sessions from the RealCBT dataset and synthetic dialogues from the CACTUS dataset. We find that while synthetic dialogues are fluent and structurally coherent, they diverge from real conversations in key emotional properties: real sessions exhibit greater emotional variability, more emotion-laden language, and more authentic patterns of reactivity and regulation. Moreover, emotional arc similarity remains low across all pairings, with especially weak alignment between real and synthetic speakers. These findings underscore the limitations of current LLM-generated therapy data and highlight the importance of emotional fidelity in mental health applications. To support future research, our dataset RealCBT is released at https://gitlab.com/xiaoyi.wang/realcbt-dataset.

Paper Structure

This paper contains 31 sections, 2 figures, 15 tables.

Figures (2)

  • Figure 1: Boxplots showing the distributions of the mean and variability for each of the three affective dimensions across three comparisons: Real vs. Synthetic Dialogues, Real vs. Synthetic Counselors, and Real vs. Synthetic Clients.
  • Figure 2: Emotion arcs of valence, arousal, and dominance for a client in three representative cases: (a) highest correlation, (b) near-zero correlation, and (c) lowest (negative) correlation between real and synthetic trajectories.