Table of Contents
Fetching ...

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

Sangmin Lee, Bolin Lai, Fiona Ryan, Bikram Boote, James M. Rehg

TL;DR

This work tackles the challenge of interpreting multi-party social interactions by jointly modeling verbal and non-verbal cues. It introduces three fine-grained tasks—Speaking Target Identification, Pronoun Coreference Resolution, and Mentioned Player Prediction—anchored in extended social deduction game data and annotated with high reliability. A novel baseline leveraging densely aligned language-visual representations tracks and aligns per-player visuals with utterances, combines visual interactions with conversation context, and achieves superior performance over language-only and prior multimodal approaches. Across YouTube and Ego4D domains, the approach yields consistent gains and generalizes well, with comprehensive ablations validating the importance of visual cues, context, and permutation learning. The authors release benchmarks and code to catalyze future research in dense multimodal social understanding.

Abstract

Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.io/projects/MMSI.

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations

TL;DR

This work tackles the challenge of interpreting multi-party social interactions by jointly modeling verbal and non-verbal cues. It introduces three fine-grained tasks—Speaking Target Identification, Pronoun Coreference Resolution, and Mentioned Player Prediction—anchored in extended social deduction game data and annotated with high reliability. A novel baseline leveraging densely aligned language-visual representations tracks and aligns per-player visuals with utterances, combines visual interactions with conversation context, and achieves superior performance over language-only and prior multimodal approaches. Across YouTube and Ego4D domains, the approach yields consistent gains and generalizes well, with comprehensive ablations validating the importance of visual cues, context, and permutation learning. The authors release benchmarks and code to catalyze future research in dense multimodal social understanding.

Abstract

Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.io/projects/MMSI.
Paper Structure (29 sections, 5 equations, 11 figures, 10 tables)

This paper contains 29 sections, 5 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Concepts of the proposed three social tasks in multi-party environments: speaking target identification, pronoun coreference resolution, and mentioned player prediction.
  • Figure 2: Concept of densely aligned language-visual representations. People are matched in the language and visual domains.
  • Figure 3: Proposed baseline model for understanding multimodal social interactions to tackle our new social tasks via densely aligned language-visual representations. The model consists of four main parts: language-visual alignment (grey), visual interaction modeling (green & purple), conversation context modeling (red), and aligned multimodal fusion for prediction (blue).
  • Figure 4: Qualitative results demonstrating the benefit of visual cues for three social tasks. The examples show cases where the language model alone fails, but the proposed multimodal baseline leveraging both language and visual cues correctly predicts the right person. Note that Player# are assigned in ascending order from left to right in the visual scenes of this figure.
  • Figure 5: Effects of conversation context length on the performance for speaking target identification.
  • ...and 6 more figures