Table of Contents
Fetching ...

EgoAVU: Egocentric Audio-Visual Understanding

Ashish Seth, Xinhao Mei, Changsheng Zhao, Varun Nagaraja, Ernie Chang, Gregory P. Meyer, Gael Le Lan, Yunyang Xiong, Vikas Chandra, Yangyang Shi, Dinesh Manocha, Zhipeng Cai

TL;DR

EgoAVU introduces a scalable data engine to enable egocentric audio–visual understanding by automatically generating multimodal narrations and QA pairs, yielding EgoAVU-Instruct (≈3 million samples from 9,000 videos) and EgoAVU-Bench (3K QA pairs from 900 videos). It mitigates modality bias by using a modular pipeline with unimodal narration, a Multimodal Context Graph (MCG), and open-source LLMs to produce coherent audio–visual narrations and diverse QA tasks. Fine-tuning MLLMs on EgoAVU-Instruct significantly boosts performance on EgoAVU-Bench (up to 113% relative improvement) and transfers to other egocentric benchmarks (up to 28%), highlighting improved audio grounding and cross-modal reasoning. The work establishes EgoAVU as a versatile data engine for scalable multimodal dataset creation and evaluation, with code and data releases to enable broader adoption and future improvements.

Abstract

Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.

EgoAVU: Egocentric Audio-Visual Understanding

TL;DR

EgoAVU introduces a scalable data engine to enable egocentric audio–visual understanding by automatically generating multimodal narrations and QA pairs, yielding EgoAVU-Instruct (≈3 million samples from 9,000 videos) and EgoAVU-Bench (3K QA pairs from 900 videos). It mitigates modality bias by using a modular pipeline with unimodal narration, a Multimodal Context Graph (MCG), and open-source LLMs to produce coherent audio–visual narrations and diverse QA tasks. Fine-tuning MLLMs on EgoAVU-Instruct significantly boosts performance on EgoAVU-Bench (up to 113% relative improvement) and transfers to other egocentric benchmarks (up to 28%), highlighting improved audio grounding and cross-modal reasoning. The work establishes EgoAVU as a versatile data engine for scalable multimodal dataset creation and evaluation, with code and data releases to enable broader adoption and future improvements.

Abstract

Understanding egocentric videos plays a vital role for embodied intelligence. Recent multi-modal large language models (MLLMs) can accept both visual and audio inputs. However, due to the challenge of obtaining text labels with coherent joint-modality information, whether MLLMs can jointly understand both modalities in egocentric videos remains under-explored. To address this problem, we introduce EgoAVU, a scalable data engine to automatically generate egocentric audio-visual narrations, questions, and answers. EgoAVU enriches human narrations with multimodal context and generates audio-visual narrations through cross-modal correlation modeling. Token-based video filtering and modular, graph-based curation ensure both data diversity and quality. Leveraging EgoAVU, we construct EgoAVU-Instruct, a large-scale training dataset of 3M samples, and EgoAVU-Bench, a manually verified evaluation split covering diverse tasks. EgoAVU-Bench clearly reveals the limitations of existing MLLMs: they bias heavily toward visual signals, often neglecting audio cues or failing to correspond audio with the visual source. Finetuning MLLMs on EgoAVU-Instruct effectively addresses this issue, enabling up to 113% performance improvement on EgoAVU-Bench. Such benefits also transfer to other benchmarks such as EgoTempo and EgoIllusion, achieving up to 28% relative performance gain. Code will be released to the community.
Paper Structure (21 sections, 2 equations, 21 figures, 7 tables)

This paper contains 21 sections, 2 equations, 21 figures, 7 tables.

Figures (21)

  • Figure 1: Overview of EgoAVU. We introduce EgoAVU, a scalable and automated data engine to enable egocentric audio–visual understanding. EgoAVU enriches existing egocentric narrations by integrating human actions with environmental context, explicitly linking visible objects and the sounds produced during interactions or surroundings. Leveraging this pipeline, we construct EgoAVU-Instruct (3M QAs) and EgoAVU-Bench (3K verified QAs), enabling systematic training and evaluation of MLLMs. Models finetuned with EgoAVU-Instruct exhibit high audio-visual grounding in egocentric settings.
  • Figure 2: EgoAVU pipeline. EgoAVU consists of four key components. (1) For each egocentric video clip, EgoAVU enhances the raw narration with detailed multisensory context using open-source MLLMs bai2025qwen25vltechnicalreportxu2025qwen2. (2) These enriched narrations are then used to select clips that exhibit diverse audio–visual dynamics. (3) Next, EgoAVU constructs a Multimodal Context Graph (MCG), automatically generated via open-source LLMs llama3modelcard, to capture complex cross-modal relations. The MCG is parsed alongside the enhanced narrations to produce coherent audio–visual narrations. (4) The generated audio-visual narrations are leveraged to generate high-quality audio–visual QA pairs, forming both the instruction-tuning dataset EgoAVU-Instruct and the evaluation benchmark EgoAVU-Bench.
  • Figure 3: Video duration distribution. Our videos includes both short clips within 1 min and long videos of 6 min.
  • Figure 4: Distribution of 20 most common visual scenarios in EgoAVU-Instruct and EgoAVU-Bench.
  • Figure 5: Distribution of proposed tasks across EgoAVU-Instruct and EgoAVU-Bench.
  • ...and 16 more figures