Table of Contents
Fetching ...

Omni-MMSI: Toward Identity-attributed Social Interaction Understanding

Xinpeng Li, Bolin Lai, Hardy Chen, Shijian Deng, Cihang Xie, Yuyin Zhou, James Matthew Rehg, Yapeng Tian

Abstract

We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: https://sampson-lee.github.io/omni-mmsi-project-page.

Omni-MMSI: Toward Identity-attributed Social Interaction Understanding

Abstract

We introduce Omni-MMSI, a new task that requires comprehensive social interaction understanding from raw audio, vision, and speech input. The task involves perceiving identity-attributed social cues (e.g., who is speaking what) and reasoning about the social interaction (e.g., whom the speaker refers to). This task is essential for developing AI assistants that can perceive and respond to human interactions. Unlike prior studies that operate on oracle-preprocessed social cues, Omni-MMSI reflects realistic scenarios where AI assistants must perceive and reason from raw data. However, existing pipelines and multi-modal LLMs perform poorly on Omni-MMSI because they lack reliable identity attribution capabilities, which leads to inaccurate social interaction understanding. To address this challenge, we propose Omni-MMSI-R, a reference-guided pipeline that produces identity-attributed social cues with tools and conducts chain-of-thought social reasoning. To facilitate this pipeline, we construct participant-level reference pairs and curate reasoning annotations on top of the existing datasets. Experiments demonstrate that Omni-MMSI-R outperforms advanced LLMs and counterparts on Omni-MMSI. Project page: https://sampson-lee.github.io/omni-mmsi-project-page.

Paper Structure

This paper contains 34 sections, 5 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Overview of the Omni-MMSI task and Omni-MMSI-R pipeline. The Omni-MMSI explores social interaction understanding in a multi-party social scene only using raw audio and video, unlike prior studies that assume identity-attributed social cues are perfectly provided. To address the challenge of attribution, our Omni-MMSI-R is explicitly guided by individual references to generate identity-attributed multi-modal cues and performs CoT reasoning for accurate social interaction understanding.
  • Figure 2: Illustration of the challenge in Omni-MMSI. The quantitative results (left) show prior pipelines, humans, and advanced Omni-LLMs show substantial accuracy drops when transitioning from oracle cues to raw audio-video input. Typical attribution failures (right), where speech and bounding boxes are mismatched to identities, reveal the weak multi-modal identity attribution of advanced Omni-LLMs.
  • Figure 3: Overview of the Omni-MMSI-R pipeline. Given a query audio-video segment with multiple participants, the system first retrieves reference audio-vision pairs that represent each individual. Task-specific tools, for transcription, diarization, detection and ReID, generate identity-attributed verbal and non-verbal social cues, specifying who speaks what and where they are. These cues, together with the references and the raw audio-video stream, form the reference-guided input. The Omni-LLM (Qwen2.5 Omni 7B fine-tuned with LoRA) then performs chain-of-thought reasoning over this input to produce an accurate response for social interaction understanding.
  • Figure 4: Illustration of preparation of reference audio-vision pairs for each participant, which serve as anchors for identity attribution.
  • Figure 5: Illustration of the construction of CoT datasets.
  • ...and 9 more figures