EgoAVU: Egocentric Audio-Visual Understanding
Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai
TL;DR
EgoAVU introduces a scalable data engine to enable egocentric audio–visual understanding by automatically generating multimodal narrations and QA pairs, yielding EgoAVU-Instruct (≈3 million samples from 9,000 videos) and EgoAVU-Bench (3K QA pairs from 900 videos). It mitigates modality bias by using a modular pipeline with unimodal narration, a Multimodal Context Graph (MCG), and open-source LLMs to produce coherent audio–visual narrations and diverse QA tasks. Fine-tuning MLLMs on EgoAVU-Instruct significantly boosts performance on EgoAVU-Bench (up to 113% relative improvement) and transfers to other egocentric benchmarks (up to 28%), highlighting improved audio grounding and cross-modal reasoning. The work establishes EgoAVU as a versatile data engine for scalable multimodal dataset creation and evaluation, with code and data releases to enable broader adoption and future improvements.
Abstract
Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.
