Table of Contents
Fetching ...

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation

Bolin Lai, Miao Liu, Fiona Ryan, James M. Rehg

TL;DR

The first transformer-based model to address the challenging problem of egocentric gaze estimation is presented, which exceeds the previous state-of-the-art model by a large margin and proposes a novel global–local correlation module to explicitly model the correlation of the global token and each local token.

Abstract

In this paper, we present the first transformer-based model to address the challenging problem of egocentric gaze estimation. We observe that the connection between the global scene context and local visual information is vital for localizing the gaze fixation from egocentric video frames. To this end, we design the transformer encoder to embed the global context as one additional visual token and further propose a novel Global-Local Correlation (GLC) module to explicitly model the correlation of the global token and each local token. We validate our model on two egocentric video datasets - EGTEA Gaze+ and Ego4D. Our detailed ablation studies demonstrate the benefits of our method. In addition, our approach exceeds previous state-of-the-arts by a large margin. We also provide additional visualizations to support our claim that global-local correlation serves a key representation for predicting gaze fixation from egocentric videos. More details can be found in our website (https://bolinlai.github.io/GLC-EgoGazeEst).

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation

TL;DR

The first transformer-based model to address the challenging problem of egocentric gaze estimation is presented, which exceeds the previous state-of-the-art model by a large margin and proposes a novel global–local correlation module to explicitly model the correlation of the global token and each local token.

Abstract

In this paper, we present the first transformer-based model to address the challenging problem of egocentric gaze estimation. We observe that the connection between the global scene context and local visual information is vital for localizing the gaze fixation from egocentric video frames. To this end, we design the transformer encoder to embed the global context as one additional visual token and further propose a novel Global-Local Correlation (GLC) module to explicitly model the correlation of the global token and each local token. We validate our model on two egocentric video datasets - EGTEA Gaze+ and Ego4D. Our detailed ablation studies demonstrate the benefits of our method. In addition, our approach exceeds previous state-of-the-arts by a large margin. We also provide additional visualizations to support our claim that global-local correlation serves a key representation for predicting gaze fixation from egocentric videos. More details can be found in our website (https://bolinlai.github.io/GLC-EgoGazeEst).
Paper Structure (18 sections, 3 equations, 8 figures, 5 tables)

This paper contains 18 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Example of local correlation and global-local correlation for the task of egocentric gaze estimation (predicting where the camera-wearer is looking using egocentric video alone). The red dot represents the gaze ground truth (from a wearable eye tracker) and the image patch that contains the gaze target has red edges. Global-local correlation models the connections between the global context and each local patch, making it possible to capture, e.g., the camera wearer and social partner are pointing at the salient object. In contrast, local-local correlations may not yield an effective representation of the scene context.
  • Figure 2: Architecture of the proposed model. The model consists of four modules -- (a) Visual Token Embedding Module encodes the input into local tokens and one global token, (b) Transformer Encoder is composed of multiple regular self-attention and linear layers, (c) Global-Local Correlation Module models the correlation of global and local tokens, and (d) Transformer Decoder maps encoded video features from Transformer Encoder and GLC to gaze prediction. $\oplus$ denotes concatenation along the channel dimension.
  • Figure 3: Visualization of gaze estimation. The first sample is from EGTEA Gaze+ and the second is from Ego4D. Estimated gaze is represented as a heatmap overlayed on input frames. Green dots denote the ground truth gaze location.
  • Figure 4: Visualization of the eight heads in global-local correlation module. The first sample is from EGTEA Gaze+ and the second is from Ego4D. Green dots denote gaze location.
  • Figure 5: Visualization of the eight heads in global-local correlation module for action recognition.
  • ...and 3 more figures