Table of Contents
Fetching ...

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

Sagnik Majumder, Ziad Al-Halah, Kristen Grauman

TL;DR

This work introduces a self-supervised approach to learn spatial audio-visual representations from egocentric videos by inpainting masked binaural audio conditioned on video and unmasked audio. The model employs a masked autoencoder framework with a novel audio-masking strategy to encourage learning robust audio-visual spatial correspondences, encoded via a multi-stream transformer architecture that fuses video and audio into a shared AV representation used for binaural audio inpainting. The learned spatial AV features transfer to two downstream social tasks—active speaker detection and spatial audio denoising—outperforming state-of-the-art baselines on EgoCom and EasyCom datasets. The approach emphasizes human-centric spatial grounding in ego contexts and demonstrates strong cross-task generalization, with qualitative analyses showing attention to speaker faces and environment features that shape spatial sound. Overall, the method provides a generic, data-efficient pathway to integrate spatial audio cues into egocentric vision systems, supporting AR and accessibility applications.

Abstract

We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. Through extensive experiments, we show that our features are generic enough to improve over multiple state-of-the-art baselines on both tasks on two challenging egocentric video datasets that offer binaural audio, EgoCom and EasyCom. Project: http://vision.cs.utexas.edu/projects/ego_av_corr.

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

TL;DR

This work introduces a self-supervised approach to learn spatial audio-visual representations from egocentric videos by inpainting masked binaural audio conditioned on video and unmasked audio. The model employs a masked autoencoder framework with a novel audio-masking strategy to encourage learning robust audio-visual spatial correspondences, encoded via a multi-stream transformer architecture that fuses video and audio into a shared AV representation used for binaural audio inpainting. The learned spatial AV features transfer to two downstream social tasks—active speaker detection and spatial audio denoising—outperforming state-of-the-art baselines on EgoCom and EasyCom datasets. The approach emphasizes human-centric spatial grounding in ego contexts and demonstrates strong cross-task generalization, with qualitative analyses showing attention to speaker faces and environment features that shape spatial sound. Overall, the method provides a generic, data-efficient pathway to integrate spatial audio cues into egocentric vision systems, supporting AR and accessibility applications.

Abstract

We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio through the synergy of audio and vision, thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. Through extensive experiments, we show that our features are generic enough to improve over multiple state-of-the-art baselines on both tasks on two challenging egocentric video datasets that offer binaural audio, EgoCom and EasyCom. Project: http://vision.cs.utexas.edu/projects/ego_av_corr.
Paper Structure (60 sections, 2 equations, 9 figures, 14 tables)

This paper contains 60 sections, 2 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Given an egocentric video and binaural audio, we aim to learn spatial correspondences between vision and audio by solving the pretext task of inpainting segments of the binaural audio. The features benefit downstream social tasks where spatial localization is important: active speaker detection and audio denoising.
  • Figure 2: Our model learns the spatial correspondence between vision and binaural audio by inpainting masked tokens of the audio channels through the use of an audio-visual encoder-decoder model. We combine random token masking (which requires solving a more local binauralization task) with complete audio channel masking (which requires more global cues to synthesize unseen binaural segments). For downstream evaluation, we fuse the features from the audio-visual encoder with the backbones for downstream tasks, and finetune them.
  • Figure 3: Masked targets and predictions shown alongside the unmasked inputs for (a) token masking and (b) channel masking. Our predictions accurately capture the global patterns in the target spectrograms, which depend on the scene's spatial properties.
  • Figure 4: Heat maps showing the image areas our model's AV encoder attends to, placed alongside the images. Brighter yellow means higher attention score. Our model attends to image regions (e.g. faces of speakers, sound-reflecting flat regions like floor and table, etc.) that strongly determine the spatial properties of the audio, including direct sources of sound (marked in red).
  • Figure 5: Success cases for ASD (a) and denoising (b)
  • ...and 4 more figures