Table of Contents
Fetching ...

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

Heeseung Yun, Ruohan Gao, Ishwarya Ananthabhotla, Anurag Kumar, Jacob Donley, Chao Li, Gunhee Kim, Vamsi Krishna Ithapu, Calvin Murdock

TL;DR

This work tackles multisensory egocentric perception under self-motion by introducing Spherical World-Locking, which places inputs on a world-locked sphere tied to head orientation. It then presents MuST, a Multisensory Spherical World-Locked Transformer that uses implicit SWL with rotation-based spatial cues and modality-specific attention to enable cross-modal collaboration without costly image-to-world projections. Across audio-visual speaker localization, auditory spherical localization, and egocentric behavior anticipation, MuST achieves significant gains over baselines and demonstrates strong generalization, supported by ablations validating the value of pose-informed embeddings and sphere-based processing. The framework promises practical impact for robust, real-time multisensory understanding in egocentric settings and invites future extensions to additional modalities and larger-scale datasets.

Abstract

Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a worldlocked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including audio-visual active speaker localization, auditory spherical source localization, and behavior anticipation in everyday activities.

Spherical World-Locking for Audio-Visual Localization in Egocentric Videos

TL;DR

This work tackles multisensory egocentric perception under self-motion by introducing Spherical World-Locking, which places inputs on a world-locked sphere tied to head orientation. It then presents MuST, a Multisensory Spherical World-Locked Transformer that uses implicit SWL with rotation-based spatial cues and modality-specific attention to enable cross-modal collaboration without costly image-to-world projections. Across audio-visual speaker localization, auditory spherical localization, and egocentric behavior anticipation, MuST achieves significant gains over baselines and demonstrates strong generalization, supported by ablations validating the value of pose-informed embeddings and sphere-based processing. The framework promises practical impact for robust, real-time multisensory understanding in egocentric settings and invites future extensions to additional modalities and larger-scale datasets.

Abstract

Egocentric videos provide comprehensive contexts for user and scene understanding, spanning multisensory perception to behavioral interaction. We propose Spherical World-Locking (SWL) as a general framework for egocentric scene representation, which implicitly transforms multisensory streams with respect to measurements of head orientation. Compared to conventional head-locked egocentric representations with a 2D planar field-of-view, SWL effectively offsets challenges posed by self-motion, allowing for improved spatial synchronization between input modalities. Using a set of multisensory embeddings on a worldlocked sphere, we design a unified encoder-decoder transformer architecture that preserves the spherical structure of the scene representation, without requiring expensive projections between image and world coordinate systems. We evaluate the effectiveness of the proposed framework on multiple benchmark tasks for egocentric video understanding, including audio-visual active speaker localization, auditory spherical source localization, and behavior anticipation in everyday activities.
Paper Structure (13 sections, 6 equations, 8 figures, 3 tables)

This paper contains 13 sections, 6 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The key idea of our framework. (a) In conventional Head-Locked (HL) frameworks, multisensory observations captured from head-mounted devices are used as-is, where self-motion introduces variability in otherwise static scenes. (b) Our Spherical World-Locking (SWL) framework compensates for self-motion with negligible overhead, leading to lower variability and better learnable scene representation.
  • Figure 2: Three multisensory localization tasks in egocentric videos that we tackle in this work: (a) audio-visual active speaker localization (§\ref{['subsec:exp_easycom']}), (b) auditory spherical source localization (§\ref{['subsec:exp_chat']}), and (c) egocentric behavior anticipation (§\ref{['subsec:exp_ariapilot']}).
  • Figure 3: Comparison of explicit and implicit spherical world-locking. While explicit SWL maps the original inputs to the spherical reference frame, implicit SWL retains the original inputs to process position ($\{p_i\}$) and semantic information ($\{x_i\}$) separately.
  • Figure 4: Our MuST model architecture. M- indicates modality-wise operations.
  • Figure 5: Qualitative examples of egocentric active speaker localization on EasyCom donley2021easycom. The red/blue boxes indicate active/non-active speakers, and the red heatmap indicates model prediction. MuST can make correct predictions for scenes with gravity misalignment (col. 1), motion blur (col. 2, 4), and multi-speakers (col. 3, 5).
  • ...and 3 more figures