Table of Contents
Fetching ...

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee

TL;DR

This work introduces AV-SpeakerBench, a speaker-centric audiovisual reasoning benchmark designed to tightly bind who speaks, what is said, and when it occurs in real-world videos. It features a fusion-driven, four-choice MCQ design with expert-curated, temporally precise annotations across 12 task types, emphasizing cross-modal grounding over static cues. Empirical results show Gemini 2.5 Pro achieving the best overall performance among evaluated models yet remaining substantially below human accuracy, with audiovisual fusion identified as the primary driver of performance gaps. The benchmark establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in multimodal systems and highlights clear directions for improving temporal alignment and audio-visual integration.

Abstract

Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

TL;DR

This work introduces AV-SpeakerBench, a speaker-centric audiovisual reasoning benchmark designed to tightly bind who speaks, what is said, and when it occurs in real-world videos. It features a fusion-driven, four-choice MCQ design with expert-curated, temporally precise annotations across 12 task types, emphasizing cross-modal grounding over static cues. Empirical results show Gemini 2.5 Pro achieving the best overall performance among evaluated models yet remaining substantially below human accuracy, with audiovisual fusion identified as the primary driver of performance gaps. The benchmark establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in multimodal systems and highlights clear directions for improving temporal alignment and audio-visual integration.

Abstract

Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

Paper Structure

This paper contains 47 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Motivation of AV-SpeakerBench. Existing video benchmarks often contain visually solvable questions—such as counting visible people—where state-of-the-art multimodal models can answer correctly even when the audio stream is muted (left; examples from Video-MME fu2025video). In contrast, questions in AV-SpeakerBench (right) are explicitly designed to require audiovisual fusion: the correct answer depends on who speaks, when they speak, and how speech events unfold over time.
  • Figure 2: Top: Examples of audiovisual reasoning questions in AV-SpeakerBench. Each question illustrates a distinct way in which audiovisual dependency is enforced—through spoken‐phrase grounding, visual event conditioning, cross-modal temporal localization, or multi-speaker coordination—ensuring that the correct answer cannot be inferred from a single modality. Bottom: Dataset Distribution. We present the distribution of videos by duration, task category, and visual complexity (measured by the number of unique visible people). Together, these statistics highlight the diversity of conversational scenes and reasoning types represented in AV-SpeakerBench.
  • Figure 3: Multimodal ablation and error analysis.
  • Figure 4: Qualitative examples of Gemini 2.5 Pro reasoning traces on AV-SpeakerBench. Green and red highlight colors indicate the model’s correct and incorrect reasoning, respectively. (a) Vision-only example answered correctly: the model identifies the correct speaker by tracking the duration and consistency of mouth movement and conversational gestures, which serve as natural visual cues for inferring who is speaking. (b) Vision-only example answered incorrectly: the model incorrectly associates slower gestures with slower speech, leading to a wrong prediction. (c) The same example as (b) but with audio input: the model correctly identifies the faster speaker once speech-rate evidence becomes available, confirming that the question requires true audiovisual fusion. (d) Vision + audio example answered incorrectly: the model predicts that only one woman speaks while both women say "Okay" after the event. Eventually, all three speakers talk after the event, showing residual difficulty in temporal alignment and speaker disambiguation.
  • Figure 5: Annotation interface for rate–comparison tasks. The interface presents annotators with the video clip, metadata (video ID, category, task type), the question, all answer choices, and the selected response. Annotators also specify the temporal window used for judgment and provide a brief justification. The examples shown correspond to (left) lowest rate of speech, (middle) highest rate of speech, and (right) lowest rate of speech for a different time span within the same video. These examples illustrate how annotators validate temporal reasoning by explicitly grounding answers in the video timeline.
  • ...and 8 more figures