See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen; Zhuoran Yu; Samuel Low Yu Hang; Subin An; Jeongik Lee; Yohan Ban; SeungEun Chung; Thanh-Huy Nguyen; JuWan Maeng; Soochahn Lee; Yong Jae Lee

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee

TL;DR

This work introduces AV-SpeakerBench, a speaker-centric audiovisual reasoning benchmark designed to tightly bind who speaks, what is said, and when it occurs in real-world videos. It features a fusion-driven, four-choice MCQ design with expert-curated, temporally precise annotations across 12 task types, emphasizing cross-modal grounding over static cues. Empirical results show Gemini 2.5 Pro achieving the best overall performance among evaluated models yet remaining substantially below human accuracy, with audiovisual fusion identified as the primary driver of performance gaps. The benchmark establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in multimodal systems and highlights clear directions for improving temporal alignment and audio-visual integration.

Abstract

Multimodal large language models (MLLMs) are expected to jointly interpret vision, audio, and language, yet existing video benchmarks rarely assess fine-grained reasoning about human speech. Many tasks remain visually solvable or only coarsely evaluate speech, offering limited insight into whether models can align who speaks, what is said, and when it occurs. We introduce AV-SpeakerBench, a curated benchmark of 3,212 multiple-choice questions focused on speaker-centric audiovisual reasoning in real-world videos. It features: (1) a speaker-centered formulation that treats speakers-not scenes-as the core reasoning unit; (2) fusion-grounded question design embedding audiovisual dependencies into question semantics; and (3) expert-curated annotations ensuring temporal precision and cross-modal validity. Comprehensive evaluations show that the Gemini family consistently outperforms open-source systems, with Gemini 2.5 Pro achieving the best results. Among open models, Qwen3-Omni-30B approaches Gemini 2.0 Flash but remains far behind Gemini 2.5 Pro, primarily due to weaker audiovisual fusion rather than visual perception. We believe AV-SpeakerBench establishes a rigorous foundation for advancing fine-grained audiovisual reasoning in future multimodal systems.

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

TL;DR

Abstract

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)