Table of Contents
Fetching ...

MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations

Gia-Bao Dinh Ho, Chang Wei Tan, Zahra Zamanzadeh Darban, Mahsa Salehi, Gholamreza Haffari, Wray Buntine

TL;DR

This work defines turning points (TPs) in casual conversations and introduces the Multi-modal Turning Point (MTP) dataset, a high-consensus, timestamped corpus built from The Big Bang Theory episodes. It formalizes three tasks—MTPC (classification), MTPD (detection), and MTPR (reasoning)—and presents TPMaven, a two-component framework that ground TP events via a scene describer (LLAVA) and a reasoner (LLMs) using tailored prompts. The dataset captures rich visual–textual cues, including utterance-level videos, transcripts, and evidence for changes in decisions, behaviors, perspectives, and feelings, with consensus-driven annotations and a circumplex-based feelings taxonomy. Experimental results show TPMaven, particularly GPT-4 with few-shot prompts, achieving strong TP classification (F1 ≈ 0.88) and reasonable TP detection (F1 ≈ 0.61), demonstrating the viability of automated grounding and explanation of conversational turn points. The work releases code and data to foster future research in multimodal grounding, turning-point reasoning, and applications in analysis of social interactions and negotiation.

Abstract

Detecting critical moments, such as emotional outbursts or changes in decisions during conversations, is crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting focusing on these moments as turning points (TPs), accompanied by a meticulously curated, high-consensus, human-annotated multi-modal dataset. We provide precise timestamps, descriptions, and visual-textual evidence high-lighting changes in emotions, behaviors, perspectives, and decisions at these turning points. We also propose a framework, TPMaven, utilizing state-of-the-art vision-language models to construct a narrative from the videos and large language models to classify and detect turning points in our multi-modal dataset. Evaluation results show that TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with additional explanations aligning with human expectations.

MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations

TL;DR

This work defines turning points (TPs) in casual conversations and introduces the Multi-modal Turning Point (MTP) dataset, a high-consensus, timestamped corpus built from The Big Bang Theory episodes. It formalizes three tasks—MTPC (classification), MTPD (detection), and MTPR (reasoning)—and presents TPMaven, a two-component framework that ground TP events via a scene describer (LLAVA) and a reasoner (LLMs) using tailored prompts. The dataset captures rich visual–textual cues, including utterance-level videos, transcripts, and evidence for changes in decisions, behaviors, perspectives, and feelings, with consensus-driven annotations and a circumplex-based feelings taxonomy. Experimental results show TPMaven, particularly GPT-4 with few-shot prompts, achieving strong TP classification (F1 ≈ 0.88) and reasonable TP detection (F1 ≈ 0.61), demonstrating the viability of automated grounding and explanation of conversational turn points. The work releases code and data to foster future research in multimodal grounding, turning-point reasoning, and applications in analysis of social interactions and negotiation.

Abstract

Detecting critical moments, such as emotional outbursts or changes in decisions during conversations, is crucial for understanding shifts in human behavior and their consequences. Our work introduces a novel problem setting focusing on these moments as turning points (TPs), accompanied by a meticulously curated, high-consensus, human-annotated multi-modal dataset. We provide precise timestamps, descriptions, and visual-textual evidence high-lighting changes in emotions, behaviors, perspectives, and decisions at these turning points. We also propose a framework, TPMaven, utilizing state-of-the-art vision-language models to construct a narrative from the videos and large language models to classify and detect turning points in our multi-modal dataset. Evaluation results show that TPMaven achieves an F1-score of 0.88 in classification and 0.61 in detection, with additional explanations aligning with human expectations.
Paper Structure (42 sections, 6 figures, 5 tables)

This paper contains 42 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Considering this example: Everyone is chatting casually. A turning point occurs when Penny (female character) starts crying, caused by her mentioning her ex while sharing her personal stories with Leonard and Sheldon (two male characters). According to human commonsense, this should be considered a significant change in the conversation because it catches the attention of the people watching, and the speakers involved (Leonard and Sheldon become confused).
  • Figure 2: The circumplex model of emotions in russell1980circumplex
  • Figure 3: Emotional distribution of the top 20 most occurrences before and after the turning point in our dataset. This caption summarizes the analysis of emotions in relation to the most frequent occurrences, highlighting changes around the identified turning point in the dataset.
  • Figure 4: Tracking results using GPT-3.5
  • Figure 5: Tracking results using GPT-3.5-turbo
  • ...and 1 more figures

Theorems & Definitions (1)

  • Definition 1