MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations
Gia-Bao Dinh Ho, Chang Wei Tan, Zahra Zamanzadeh Darban, Mahsa Salehi, Gholamreza Haffari, Wray Buntine
TL;DR
This work defines turning points (TPs) in casual conversations and introduces the Multi-modal Turning Point (MTP) dataset, a high-consensus, timestamped corpus built from The Big Bang Theory episodes. It formalizes three tasks—MTPC (classification), MTPD (detection), and MTPR (reasoning)—and presents TPMaven, a two-component framework that ground TP events via a scene describer (LLAVA) and a reasoner (LLMs) using tailored prompts. The dataset captures rich visual–textual cues, including utterance-level videos, transcripts, and evidence for changes in decisions, behaviors, perspectives, and feelings, with consensus-driven annotations and a circumplex-based feelings taxonomy. Experimental results show TPMaven, particularly GPT-4 with few-shot prompts, achieving strong TP classification (F1 ≈ 0.88) and reasonable TP detection (F1 ≈ 0.61), demonstrating the viability of automated grounding and explanation of conversational turn points. The work releases code and data to foster future research in multimodal grounding, turning-point reasoning, and applications in analysis of social interactions and negotiation.
Abstract
Detecting critical moments, such as emotional outbursts or changes in decisions during conversations, is crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting focusing on these moments as turning points (TPs), accompanied by a meticulously curated, high-consensus, human-annotated multi-modal dataset. We provide precise timestamps, descriptions, and visual-textual evidence high-lighting changes in emotions, behaviors, perspectives, and decisions at these turning points. We also propose a framework, TPMaven, utilizing state-of-the-art vision-language models to construct a narrative from the videos and large language models to classify and detect turning points in our multi-modal dataset. Evaluation results show that TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with additional explanations aligning with human expectations.
