Multimodal Conversation Structure Understanding
Kent K. Chang, Mackenzie Hanh Cramer, Anna Ho, Ti Ti Nguyen, Yilin Yuan, David Bamman
TL;DR
This work tackles multimodal conversation structure understanding by combining sociolinguistic theory with a new dataset, TV-MMPC, annotated for speaker identity, addressees, side-participants, and reply-to relations in TVQA clips. It introduces a unified framework with two core tasks—conversational role attribution and conversation disentanglement—and demonstrates that multimodal LLMs, particularly audio-visual models, generally outperform heuristic baselines, though anonymization degrades performance, revealing memorization and reliability concerns. A resource-efficient LoRA fine-tuning approach improves certain social-structure predictions, and a sociolinguistic analysis on 350,842 utterances uncovers gendered patterns in role assignment and audience design, highlighting representation dynamics in media. The dataset and findings provide a foundation for robust conversation-structure understanding and underscore the importance of cross-modal cues in capturing social dynamics in multimodal content.
Abstract
While multimodal large language models (LLMs) excel at dialogue, whether they can adequately parse the structure of conversation -- conversational roles and threading -- remains underexplored. In this work, we introduce a suite of tasks and release TV-MMPC, a new annotated dataset, for multimodal conversation structure understanding. Our evaluation reveals that while all multimodal LLMs outperform our heuristic baseline, even the best-performing model we consider experiences a substantial drop in performance when character identities of the conversation are anonymized. Beyond evaluation, we carry out a sociolinguistic analysis of 350,842 utterances in TVQA. We find that while female characters initiate conversations at rates in proportion to their speaking time, they are 1.2 times more likely than men to be cast as an addressee or side-participant, and the presence of side-participants shifts the conversational register from personal to social.
