MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations

Gia-Bao Dinh Ho; Chang Wei Tan; Zahra Zamanzadeh Darban; Mahsa Salehi; Gholamreza Haffari; Wray Buntine

MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations

Gia-Bao Dinh Ho, Chang Wei Tan, Zahra Zamanzadeh Darban, Mahsa Salehi, Gholamreza Haffari, Wray Buntine

TL;DR

This work defines turning points (TPs) in casual conversations and introduces the Multi-modal Turning Point (MTP) dataset, a high-consensus, timestamped corpus built from The Big Bang Theory episodes. It formalizes three tasks—MTPC (classification), MTPD (detection), and MTPR (reasoning)—and presents TPMaven, a two-component framework that ground TP events via a scene describer (LLAVA) and a reasoner (LLMs) using tailored prompts. The dataset captures rich visual–textual cues, including utterance-level videos, transcripts, and evidence for changes in decisions, behaviors, perspectives, and feelings, with consensus-driven annotations and a circumplex-based feelings taxonomy. Experimental results show TPMaven, particularly GPT-4 with few-shot prompts, achieving strong TP classification (F1 ≈ 0.88) and reasonable TP detection (F1 ≈ 0.61), demonstrating the viability of automated grounding and explanation of conversational turn points. The work releases code and data to foster future research in multimodal grounding, turning-point reasoning, and applications in analysis of social interactions and negotiation.

Abstract

Detecting critical moments, such as emotional outbursts or changes in decisions during conversations, is crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting focusing on these moments as turning points (TPs), accompanied by a meticulously curated, high-consensus, human-annotated multi-modal dataset. We provide precise timestamps, descriptions, and visual-textual evidence high-lighting changes in emotions, behaviors, perspectives, and decisions at these turning points. We also propose a framework, TPMaven, utilizing state-of-the-art vision-language models to construct a narrative from the videos and large language models to classify and detect turning points in our multi-modal dataset. Evaluation results show that TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with additional explanations aligning with human expectations.

MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations

TL;DR

Abstract

Paper Structure (42 sections, 6 figures, 5 tables)

This paper contains 42 sections, 6 figures, 5 tables.

Introduction
Related work
Problem formulation
The MTP Dataset
Scene boundary annotation
Creating utterance-level videos
Multi-modal Turning Point Annotation
Turning Point Evidence Annotation
Feelings Annotation
Annotation consensus
TPMaven framework
Experiments
Conclusion
MTP Dataset creation details
Preprocessing
...and 27 more sections

Figures (6)

Figure 1: Considering this example: Everyone is chatting casually. A turning point occurs when Penny (female character) starts crying, caused by her mentioning her ex while sharing her personal stories with Leonard and Sheldon (two male characters). According to human commonsense, this should be considered a significant change in the conversation because it catches the attention of the people watching, and the speakers involved (Leonard and Sheldon become confused).
Figure 2: The circumplex model of emotions in russell1980circumplex
Figure 3: Emotional distribution of the top 20 most occurrences before and after the turning point in our dataset. This caption summarizes the analysis of emotions in relation to the most frequent occurrences, highlighting changes around the identified turning point in the dataset.
Figure 4: Tracking results using GPT-3.5
Figure 5: Tracking results using GPT-3.5-turbo
...and 1 more figures

Theorems & Definitions (1)

Definition 1

MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations

TL;DR

Abstract

MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (1)