Table of Contents
Fetching ...

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

Bolin Lai, Fiona Ryan, Wenqi Jia, Miao Liu, James M. Rehg

TL;DR

A Contrastive Spatial-Temporal Separable (CSTS) fusion approach that adopts two modules to separately capture audio-visual correlations in spatial and temporal dimensions, and applies a contrastive loss on the re-weighted audio-visual features from fusion modules for representation learning.

Abstract

Egocentric gaze anticipation serves as a key building block for the emerging capability of Augmented Reality. Notably, gaze behavior is driven by both visual cues and audio signals during daily activities. Motivated by this observation, we introduce the first model that leverages both the video and audio modalities for egocentric gaze anticipation. Specifically, we propose a Contrastive Spatial-Temporal Separable (CSTS) fusion approach that adopts two modules to separately capture audio-visual correlations in spatial and temporal dimensions, and applies a contrastive loss on the re-weighted audio-visual features from fusion modules for representation learning. We conduct extensive ablation studies and thorough analysis using two egocentric video datasets: Ego4D and Aria, to validate our model design. We demonstrate the audio improves the performance by +2.5% and +2.4% on the two datasets. Our model also outperforms the prior state-of-the-art methods by at least +1.9% and +1.6%. Moreover, we provide visualizations to show the gaze anticipation results and provide additional insights into audio-visual representation learning. The code and data split are available on our website (https://bolinlai.github.io/CSTS-EgoGazeAnticipation/).

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation

TL;DR

A Contrastive Spatial-Temporal Separable (CSTS) fusion approach that adopts two modules to separately capture audio-visual correlations in spatial and temporal dimensions, and applies a contrastive loss on the re-weighted audio-visual features from fusion modules for representation learning.

Abstract

Egocentric gaze anticipation serves as a key building block for the emerging capability of Augmented Reality. Notably, gaze behavior is driven by both visual cues and audio signals during daily activities. Motivated by this observation, we introduce the first model that leverages both the video and audio modalities for egocentric gaze anticipation. Specifically, we propose a Contrastive Spatial-Temporal Separable (CSTS) fusion approach that adopts two modules to separately capture audio-visual correlations in spatial and temporal dimensions, and applies a contrastive loss on the re-weighted audio-visual features from fusion modules for representation learning. We conduct extensive ablation studies and thorough analysis using two egocentric video datasets: Ego4D and Aria, to validate our model design. We demonstrate the audio improves the performance by +2.5% and +2.4% on the two datasets. Our model also outperforms the prior state-of-the-art methods by at least +1.9% and +1.6%. Moreover, we provide visualizations to show the gaze anticipation results and provide additional insights into audio-visual representation learning. The code and data split are available on our website (https://bolinlai.github.io/CSTS-EgoGazeAnticipation/).
Paper Structure (26 sections, 9 equations, 7 figures, 9 tables)

This paper contains 26 sections, 9 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The problem setting of egocentric gaze anticipation. $\tau_o$ denotes the observation time, and $\tau_a$ denotes the anticipation time. Given the video frames and audio signals of the Input Video Sequence, the model seeks to predict the gaze fixation distribution for the time steps in the Gaze Anticipation Sequence. Green dots indicate the gaze targets in future frames and the heatmap shows the gaze anticipation result from our model.
  • Figure 2: Overview of the proposed model. The video embeddings $\phi(x)$ and audio embeddings $\psi(a)$ are obtained by two transformer-based encoders. We then model the correlations of visual and audio embeddings using two separate branches -- (1) spatial fusion, which learns the spatial co-occurence of audio signals and visual objects in each frame, and (2) temporal fusion, which captures the temporal correlations and possible gaze movement. A contrastive loss is adopted to facilitate audio-visual representation learning. We input fused embeddings into a decoder for final gaze anticipation results.
  • Figure 3: The performance of gaze anticipation in each frame. Our model (CSTS) consistently outperforms all prior methods by a notable margin.
  • Figure 4: Egocentric gaze anticipation results from our model and other baselines. We show the results of four future time steps uniformly sampled from the anticipation segments. Green dots indicate the ground truth gaze location.
  • Figure 5: Visualization of the spatial correlation weights. All video frames are sorted in a chronological order indexed by the numbers on the top-right corner.
  • ...and 2 more figures