Table of Contents
Fetching ...

Multimodal Conversation Structure Understanding

Kent K. Chang, Mackenzie Hanh Cramer, Anna Ho, Ti Ti Nguyen, Yilin Yuan, David Bamman

TL;DR

This work tackles multimodal conversation structure understanding by combining sociolinguistic theory with a new dataset, TV-MMPC, annotated for speaker identity, addressees, side-participants, and reply-to relations in TVQA clips. It introduces a unified framework with two core tasks—conversational role attribution and conversation disentanglement—and demonstrates that multimodal LLMs, particularly audio-visual models, generally outperform heuristic baselines, though anonymization degrades performance, revealing memorization and reliability concerns. A resource-efficient LoRA fine-tuning approach improves certain social-structure predictions, and a sociolinguistic analysis on 350,842 utterances uncovers gendered patterns in role assignment and audience design, highlighting representation dynamics in media. The dataset and findings provide a foundation for robust conversation-structure understanding and underscore the importance of cross-modal cues in capturing social dynamics in multimodal content.

Abstract

While multimodal large language models (LLMs) excel at dialogue, whether they can adequately parse the structure of conversation -- conversational roles and threading -- remains underexplored. In this work, we introduce a suite of tasks and release TV-MMPC, a new annotated dataset, for multimodal conversation structure understanding. Our evaluation reveals that while all multimodal LLMs outperform our heuristic baseline, even the best-performing model we consider experiences a substantial drop in performance when character identities of the conversation are anonymized. Beyond evaluation, we carry out a sociolinguistic analysis of 350,842 utterances in TVQA. We find that while female characters initiate conversations at rates in proportion to their speaking time, they are 1.2 times more likely than men to be cast as an addressee or side-participant, and the presence of side-participants shifts the conversational register from personal to social.

Multimodal Conversation Structure Understanding

TL;DR

This work tackles multimodal conversation structure understanding by combining sociolinguistic theory with a new dataset, TV-MMPC, annotated for speaker identity, addressees, side-participants, and reply-to relations in TVQA clips. It introduces a unified framework with two core tasks—conversational role attribution and conversation disentanglement—and demonstrates that multimodal LLMs, particularly audio-visual models, generally outperform heuristic baselines, though anonymization degrades performance, revealing memorization and reliability concerns. A resource-efficient LoRA fine-tuning approach improves certain social-structure predictions, and a sociolinguistic analysis on 350,842 utterances uncovers gendered patterns in role assignment and audience design, highlighting representation dynamics in media. The dataset and findings provide a foundation for robust conversation-structure understanding and underscore the importance of cross-modal cues in capturing social dynamics in multimodal content.

Abstract

While multimodal large language models (LLMs) excel at dialogue, whether they can adequately parse the structure of conversation -- conversational roles and threading -- remains underexplored. In this work, we introduce a suite of tasks and release TV-MMPC, a new annotated dataset, for multimodal conversation structure understanding. Our evaluation reveals that while all multimodal LLMs outperform our heuristic baseline, even the best-performing model we consider experiences a substantial drop in performance when character identities of the conversation are anonymized. Beyond evaluation, we carry out a sociolinguistic analysis of 350,842 utterances in TVQA. We find that while female characters initiate conversations at rates in proportion to their speaking time, they are 1.2 times more likely than men to be cast as an addressee or side-participant, and the presence of side-participants shifts the conversational register from personal to social.

Paper Structure

This paper contains 56 sections, 5 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Our proposed structured prediction task for multimodal conversation structure understanding. Grounded in sociolinguistic and conversation analysis Goffman1981-fqGoffman1983-ooNg1993-naGoodwin1981-zsClark1982-am, the task requires predicting, for each utterance in the given clip: the speaker, addressee(s), side-participants, and the utterance it replies to. The example, taken from The Big Bang Theory in TVQA Lei2018-riLei2020-lq, illustrates our unified formulation, which treats conversational role attribution and conversation disentanglement as complementary subtasks for modeling the interactional dynamics of dialogue. Further analysis of this example can be found in Appendix §\ref{['sec:anno_examples']}.
  • Figure 2: Data creation pipeline. Top: four stages from raw TVQA clips to final annotations. Bottom: samples from automated preprocessing (Stage 2) and human annotation (Stage 3), drawn from clip ID s02e09_seg02_clip_04.
  • Figure 3: Signed explained rank variance from Spearman's $\rho$ between clip-level features and F1 scores for individual conversational roles. Bars indicate direction and magnitude of correlation: solid ones are significant ($p<0.05$).
  • Figure 4: Female share of starting vs. holding conversational threads, aggregated by show and overall. Left: raw percentage of threads started or held by women. Right: normalized difference ($\Delta$) between female share of metric and speaking time within each clip. Points are bootstrapped means (95% CIs).
  • Figure 5: The relationship between gender and conversational roles. (a) $P(\text{gender} | \text{role})$: the probability that a participant in a given role is female or male; (b) $P(\text{role} | \text{gender})$: the probability distribution of roles for each gender; (c) odds ratios from a multinomial regression, with speaker as the reference.
  • ...and 4 more figures