Table of Contents
Fetching ...

Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations

Yuxuan Zhang, Yulong Li, Zichen Yu, Feilong Tang, Zhixiang Lu, Chong Li, Kang Dang, Jionglong Su

TL;DR

CauseMotion tackles the challenge of emotional causality in long-form dialogues by integrating Retrieval-Augmented Generation (RAG) with multimodal fusion of textual and audio features. It introduces a sliding-window dialogue knowledge base, an audio-augmented multimodal embedding $E_m = \text{Concat}(E_t, E_e, E_r)$ with $d_m = d_t + d_e + 1$, and a causal reasoning framework that weights relations using $\alpha,\beta,\gamma$ with $\alpha+\beta+\gamma=1$. The approach is evaluated on the ATLAS-6 and DiaASQ datasets, achieving state-of-the-art results, including an $0.574$ causal chain accuracy on ATLAS (vs $0.528$ for GPT-4o) and leading Span Match and Pair Extraction on DiaASQ, demonstrating the benefits of long-range, multimodal reasoning for emotion-cause inference. The work provides a new benchmark with 20,000 synthetic long dialogues and a 2,745-real-dialogue validation set, and shows that multimodal context integration substantially improves depth of emotional understanding and causal inference in LLMs. Overall, CauseMotion sets a new standard for long-sequence emotional causality analysis and points to broader applications in multimodal affective dialogue systems.

Abstract

Long-sequence causal reasoning seeks to uncover causal relationships within extended time series data but is hindered by complex dependencies and the challenges of validating causal links. To address the limitations of large-scale language models (e.g., GPT-4) in capturing intricate emotional causality within extended dialogues, we propose CauseMotion, a long-sequence emotional causal reasoning framework grounded in Retrieval-Augmented Generation (RAG) and multimodal fusion. Unlike conventional methods relying only on textual information, CauseMotion enriches semantic representations by incorporating audio-derived features-vocal emotion, emotional intensity, and speech rate-into textual modalities. By integrating RAG with a sliding window mechanism, it effectively retrieves and leverages contextually relevant dialogue segments, thus enabling the inference of complex emotional causal chains spanning multiple conversational turns. To evaluate its effectiveness, we constructed the first benchmark dataset dedicated to long-sequence emotional causal reasoning, featuring dialogues with over 70 turns. Experimental results demonstrate that the proposed RAG-based multimodal integrated approach, the efficacy of substantially enhances both the depth of emotional understanding and the causal inference capabilities of large-scale language models. A GLM-4 integrated with CauseMotion achieves an 8.7% improvement in causal accuracy over the original model and surpasses GPT-4o by 1.2%. Additionally, on the publicly available DiaASQ dataset, CauseMotion-GLM-4 achieves state-of-the-art results in accuracy, F1 score, and causal reasoning accuracy.

Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations

TL;DR

CauseMotion tackles the challenge of emotional causality in long-form dialogues by integrating Retrieval-Augmented Generation (RAG) with multimodal fusion of textual and audio features. It introduces a sliding-window dialogue knowledge base, an audio-augmented multimodal embedding with , and a causal reasoning framework that weights relations using with . The approach is evaluated on the ATLAS-6 and DiaASQ datasets, achieving state-of-the-art results, including an causal chain accuracy on ATLAS (vs for GPT-4o) and leading Span Match and Pair Extraction on DiaASQ, demonstrating the benefits of long-range, multimodal reasoning for emotion-cause inference. The work provides a new benchmark with 20,000 synthetic long dialogues and a 2,745-real-dialogue validation set, and shows that multimodal context integration substantially improves depth of emotional understanding and causal inference in LLMs. Overall, CauseMotion sets a new standard for long-sequence emotional causality analysis and points to broader applications in multimodal affective dialogue systems.

Abstract

Long-sequence causal reasoning seeks to uncover causal relationships within extended time series data but is hindered by complex dependencies and the challenges of validating causal links. To address the limitations of large-scale language models (e.g., GPT-4) in capturing intricate emotional causality within extended dialogues, we propose CauseMotion, a long-sequence emotional causal reasoning framework grounded in Retrieval-Augmented Generation (RAG) and multimodal fusion. Unlike conventional methods relying only on textual information, CauseMotion enriches semantic representations by incorporating audio-derived features-vocal emotion, emotional intensity, and speech rate-into textual modalities. By integrating RAG with a sliding window mechanism, it effectively retrieves and leverages contextually relevant dialogue segments, thus enabling the inference of complex emotional causal chains spanning multiple conversational turns. To evaluate its effectiveness, we constructed the first benchmark dataset dedicated to long-sequence emotional causal reasoning, featuring dialogues with over 70 turns. Experimental results demonstrate that the proposed RAG-based multimodal integrated approach, the efficacy of substantially enhances both the depth of emotional understanding and the causal inference capabilities of large-scale language models. A GLM-4 integrated with CauseMotion achieves an 8.7% improvement in causal accuracy over the original model and surpasses GPT-4o by 1.2%. Additionally, on the publicly available DiaASQ dataset, CauseMotion-GLM-4 achieves state-of-the-art results in accuracy, F1 score, and causal reasoning accuracy.
Paper Structure (10 sections, 14 equations, 4 figures, 2 tables)

This paper contains 10 sections, 14 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Emotional fluctuation patterns and interpersonal influences during key conversational events among participants. The figure highlights how emotions evolve through various dialogue phases and the effects of interactions such as emotional contagion, support, conflict escalation, and mediation on these emotional dynamics.
  • Figure 2: Interactional patterns among interlocutors and the corresponding emotional dynamics of a specific participant within the dialogue. The figure maps various interaction types—including supportive exchanges, confrontational dialogues, and neutral statements—to the participant's emotional states such as happiness, frustration, anger, and calmness.
  • Figure 3: An overview of our multimodal emotion analysis framework. The framework first extracts audio and dialogue features using the SoundVoice Feature Encoder for six-tuple emotion analysis, encompassing Holder, Target, Aspect, Opinion, Sentiment, and Rationale. The Recognition Module integrates these features, followed by an LLM-based causal reasoning component to analyze emotional relationships. The output is visualized as a social interaction graph, illustrating the emotional dynamics and interactions between participants.
  • Figure 4: The average performance is evaluated on 2,745 real-world samples from the ATLAS-6 dataset.