Table of Contents
Fetching ...

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

Bingwen Zhu, Yuqian Fu, Qiaole Dong, Guolei Sun, Tianwen Qian, Yuzheng Wu, Danda Pani Paudel, Xiangyang Xue, Yanwei Fu

TL;DR

EgoSound is introduced, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs, and establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.

Abstract

Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.

EgoSound: Benchmarking Sound Understanding in Egocentric Videos

TL;DR

EgoSound is introduced, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs, and establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.

Abstract

Multimodal Large Language Models (MLLMs) have recently achieved remarkable progress in vision-language understanding. Yet, human perception is inherently multisensory, integrating sight, sound, and motion to reason about the world. Among these modalities, sound provides indispensable cues about spatial layout, off-screen events, and causal interactions, particularly in egocentric settings where auditory and visual signals are tightly coupled. To this end, we introduce EgoSound, the first benchmark designed to systematically evaluate egocentric sound understanding in MLLMs. EgoSound unifies data from Ego4D and EgoBlind, encompassing both sighted and sound-dependent experiences. It defines a seven-task taxonomy spanning intrinsic sound perception, spatial localization, causal inference, and cross-modal reasoning. Constructed through a multi-stage auto-generative pipeline, EgoSound contains 7315 validated QA pairs across 900 videos. Comprehensive experiments on nine state-of-the-art MLLMs reveal that current models exhibit emerging auditory reasoning abilities but remain limited in fine-grained spatial and causal understanding. EgoSound establishes a challenging foundation for advancing multisensory egocentric intelligence, bridging the gap between seeing and truly hearing the world.
Paper Structure (37 sections, 17 figures, 3 tables)

This paper contains 37 sections, 17 figures, 3 tables.

Figures (17)

  • Figure 1: EgoSound vs existing egocentric Video Question Answering (VideoQA). Prior datasets (left) mangalam2023egoschemaplizzari2025omnia focus solely on vision-centric question answering with no awareness of audio, whereas EgoSound constructs a more complex and comprehensive audio-visual QA dataset tailored for sound understanding. It is built from two dataset sources xiao2025egoblindgrauman2022ego4d, includes 900 videos and 7315 high-quality QA pairs, and spans seven task categories—making it a benchmark that can both listen and see.
  • Figure 2: Overview of the EgoSound data curation pipeline. We first identifies human interaction events, then generates interaction-grounded and sound-centric audio-visual captions, and finally build visually-verified OpenQA pairs corresponding to the seven core tasks.
  • Figure 3: Overview of the EgoSound task taxonomy and statistics. (Top) Statistics on video length, question type, and the number of questions for each task category. (Bottom) A selection of representative examples for each core task of EgoSound.
  • Figure 4: Accuracy comparison on EgoSound for Qwen3-Omni-Thinking xu2025qwen3 with audio–visual vs audio-only input. Sound-dependent three tasks results are shown in the left (orange), while the right (blue) shows the results for other four tasks that depend on both visual and audio input.
  • Figure 5: Comparison of Cross-Modal Reasoning with and without visual input. The video shows an egocentric airplane scene in which a flight attendant handles a blanket for the passenger. The question asks what happens after the rustling sound produced during this action. The left side presents model outputs with audio-visual frames; the right side presents outputs with audio alone.
  • ...and 12 more figures