Table of Contents
Fetching ...

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Tianyu Xie, Jinfa Huang, Yuexiao Ma, Rongfang Luo, Yan Yang, Wang Chen, Yuhui Zeng, Ruize Fang, Yixuan Zou, Xiawu Zheng, Jiebo Luo, Rongrong Ji

Abstract

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.

SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models

Abstract

Omni-modal large language models (OLMs) redefine human-machine interaction by natively integrating audio, vision, and text. However, existing OLM benchmarks remain anchored to static, accuracy-centric tasks, leaving a critical gap in assessing social interactivity, the fundamental capacity to navigate dynamic cues in natural dialogues. To this end, we propose SocialOmni, a comprehensive benchmark that operationalizes the evaluation of this conversational interactivity across three core dimensions: (i) speaker separation and identification (who is speaking), (ii) interruption timing control (when to interject), and (iii) natural interruption generation (how to phrase the interruption). SocialOmni features 2,000 perception samples and a quality-controlled diagnostic set of 209 interaction-generation instances with strict temporal and contextual constraints, complemented by controlled audio-visual inconsistency scenarios to test model robustness. We benchmarked 12 leading OLMs, which uncovers significant variance in their social-interaction capabilities across models. Furthermore, our analysis reveals a pronounced decoupling between a model's perceptual accuracy and its ability to generate contextually appropriate interruptions, indicating that understanding-centric metrics alone are insufficient to characterize conversational social competence. More encouragingly, these diagnostics from SocialOmni yield actionable signals for bridging the perception-interaction divide in future OLMs.
Paper Structure (35 sections, 7 equations, 10 figures, 7 tables, 2 algorithms)

This paper contains 35 sections, 7 equations, 10 figures, 7 tables, 2 algorithms.

Figures (10)

  • Figure 1: Overview of SocialOmni. (a) Benchmark data distribution across 15 subcategories and four domains, with consistent/inconsistent stratification and perception/generation task splits. (b) Overview of the proposed evaluation tasks and metrics. (c) Performance comparison of 12 OLMs on both Task I and Task II.
  • Figure 2: Illustration of the SocialOmni evaluation pipeline. Given a multi-modal conversation stream (Zone 1), SocialOmni constructs both audio-vision inconsistent and consistent consistent (Zone 2), then evaluates models on speaker perception (Task I) and turn-entry generation (Task II) with LLM-based judging (Zone 3).
  • Figure 3: Cross-axis capability profiles. Each polygon shows one model over normalized who--when--how dimensions. No single model dominates all axes, revealing distinct strengths and weaknesses.
  • Figure 4: Timing-phase decomposition for turn entry. Early/On-time/Late rates expose whether a model tends to interrupt prematurely or miss the optimal conversational window during dialogue.
  • Figure 5: Precision--recall operating points for when decisions. Iso-F1 guides highlight the fundamental trade-off between cautious and trigger-happy turn-entry strategies in dialogue systems.
  • ...and 5 more figures