Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations
Yuxuan Zhang, Yulong Li, Zichen Yu, Feilong Tang, Zhixiang Lu, Chong Li, Kang Dang, Jionglong Su
TL;DR
CauseMotion tackles the challenge of emotional causality in long-form dialogues by integrating Retrieval-Augmented Generation (RAG) with multimodal fusion of textual and audio features. It introduces a sliding-window dialogue knowledge base, an audio-augmented multimodal embedding $E_m = \text{Concat}(E_t, E_e, E_r)$ with $d_m = d_t + d_e + 1$, and a causal reasoning framework that weights relations using $\alpha,\beta,\gamma$ with $\alpha+\beta+\gamma=1$. The approach is evaluated on the ATLAS-6 and DiaASQ datasets, achieving state-of-the-art results, including an $0.574$ causal chain accuracy on ATLAS (vs $0.528$ for GPT-4o) and leading Span Match and Pair Extraction on DiaASQ, demonstrating the benefits of long-range, multimodal reasoning for emotion-cause inference. The work provides a new benchmark with 20,000 synthetic long dialogues and a 2,745-real-dialogue validation set, and shows that multimodal context integration substantially improves depth of emotional understanding and causal inference in LLMs. Overall, CauseMotion sets a new standard for long-sequence emotional causality analysis and points to broader applications in multimodal affective dialogue systems.
Abstract
Long-sequence causal reasoning seeks to uncover causal relationships within extended time series data but is hindered by complex dependencies and the challenges of validating causal links. To address the limitations of large-scale language models (e.g., GPT-4) in capturing intricate emotional causality within extended dialogues, we propose CauseMotion, a long-sequence emotional causal reasoning framework grounded in Retrieval-Augmented Generation (RAG) and multimodal fusion. Unlike conventional methods relying only on textual information, CauseMotion enriches semantic representations by incorporating audio-derived features-vocal emotion, emotional intensity, and speech rate-into textual modalities. By integrating RAG with a sliding window mechanism, it effectively retrieves and leverages contextually relevant dialogue segments, thus enabling the inference of complex emotional causal chains spanning multiple conversational turns. To evaluate its effectiveness, we constructed the first benchmark dataset dedicated to long-sequence emotional causal reasoning, featuring dialogues with over 70 turns. Experimental results demonstrate that the proposed RAG-based multimodal integrated approach, the efficacy of substantially enhances both the depth of emotional understanding and the causal inference capabilities of large-scale language models. A GLM-4 integrated with CauseMotion achieves an 8.7% improvement in causal accuracy over the original model and surpasses GPT-4o by 1.2%. Additionally, on the publicly available DiaASQ dataset, CauseMotion-GLM-4 achieves state-of-the-art results in accuracy, F1 score, and causal reasoning accuracy.
