Table of Contents
Fetching ...

ViTGaze: Gaze Following with Interaction Features in Vision Transformers

Yuehao Song, Xinggang Wang, Jingfeng Yao, Wenyu Liu, Jinglin Zhang, Xiangmin Xu

TL;DR

This investigation reveals that ViT with self-supervised pre-training has an enhanced ability to extract correlation information and achieves state-of-the-art performance among all single-modality methods and very comparable performance against multi-modality methods with 59% fewer parameters.

Abstract

Gaze following aims to interpret human-scene interactions by predicting the person's focal point of gaze. Prevailing approaches often adopt a two-stage framework, whereby multi-modality information is extracted in the initial stage for gaze target prediction. Consequently, the efficacy of these methods highly depends on the precision of the preceding modality extraction. Others use a single-modality approach with complex decoders, increasing network computational load. Inspired by the remarkable success of pre-trained plain vision transformers (ViTs), we introduce a novel single-modality gaze following framework called ViTGaze. In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders (relative decoder parameters less than 1%). Our principal insight is that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Leveraging this presumption, we formulate a framework consisting of a 4D interaction encoder and a 2D spatial guidance module to extract human-scene interaction information from self-attention maps. Furthermore, our investigation reveals that ViT with self-supervised pre-training has an enhanced ability to extract correlation information. Many experiments have been conducted to demonstrate the performance of the proposed method. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement in the area under curve (AUC) score, 5.1% improvement in the average precision (AP)) and very comparable performance against multi-modality methods with 59% number of parameters less.

ViTGaze: Gaze Following with Interaction Features in Vision Transformers

TL;DR

This investigation reveals that ViT with self-supervised pre-training has an enhanced ability to extract correlation information and achieves state-of-the-art performance among all single-modality methods and very comparable performance against multi-modality methods with 59% fewer parameters.

Abstract

Gaze following aims to interpret human-scene interactions by predicting the person's focal point of gaze. Prevailing approaches often adopt a two-stage framework, whereby multi-modality information is extracted in the initial stage for gaze target prediction. Consequently, the efficacy of these methods highly depends on the precision of the preceding modality extraction. Others use a single-modality approach with complex decoders, increasing network computational load. Inspired by the remarkable success of pre-trained plain vision transformers (ViTs), we introduce a novel single-modality gaze following framework called ViTGaze. In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders (relative decoder parameters less than 1%). Our principal insight is that the inter-token interactions within self-attention can be transferred to interactions between humans and scenes. Leveraging this presumption, we formulate a framework consisting of a 4D interaction encoder and a 2D spatial guidance module to extract human-scene interaction information from self-attention maps. Furthermore, our investigation reveals that ViT with self-supervised pre-training has an enhanced ability to extract correlation information. Many experiments have been conducted to demonstrate the performance of the proposed method. Our method achieves state-of-the-art (SOTA) performance among all single-modality methods (3.4% improvement in the area under curve (AUC) score, 5.1% improvement in the average precision (AP)) and very comparable performance against multi-modality methods with 59% number of parameters less.
Paper Structure (30 sections, 5 equations, 8 figures, 7 tables)

This paper contains 30 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Comparison with state-of-the-art (SOTA) methods of area under curve (AUC) scores. ViTGaze achieves SOTA performance among all methods in the AUC scores on both GazeFollow Recasens_GazeFollow_2015_NIPS and VideoAttentionTarget Chong_VideoAttn_2020_CVPR. The circle size indicates the number of parameters of each method. We compare our results (ViTGaze) with those of Lian et al. Lian_ACCV_2019, Chong et al. Chong_VideoAttn_2020_CVPR, DAM Fang_DAM_2021_CVPR, HGTTR Tu_HGGTR_2022_CVPR, Tonini et al. Tonini_GOT_2023_ICCV, GTR tu2023joint, Miao et al. Miao_PDP_2023_WACV, and Gupta et al. Gupta_MM_2022_CVPR.
  • Figure 2: Overall structure of ViTGaze. We achieve high-performance gaze following by predicting interactions with multi-level and multi-head attention maps, which we refer to as 4D features, guided by 2D spatial information. It leverages the pre-trained vision transformer and lightweight decoders which have fewer than 1 M parameters. $\bigotimes$ refers to the weighted sum. $C, K, H, h, w$ refers to the number of feature channels, the number of semantic levels, the number of attention heads, and the height and the width of the input image, respectively. $L_\mathrm{hm}$, $L_\mathrm{io}$, and $L_\mathrm{aux}$ refer to gaze heatmap loss, gaze in-out loss, and auxiliary head regression loss. Linear proj., Conv2d, BatchNorm, ReLU, Sigmoid, and Upsample refer to linear projection, 2D convolution, batch normalization, ReLU activation, sigmoid activation, and bilinear upsample, respectively.
  • Figure 3: Visualization of vision transformer (ViT) features. In contrast to the feature map revealing global object-level semantics, the attention map of tokens overlapped with the head reflects human-scene interactions. Q, K, V, and N refer to queries, keys, values, and the number of transformer blocks. $\bigotimes$, $\bigoplus$, and $\sigma$ refer to weighted sum, add, and softmax. LayerNorm and FFN refer to layer normalization and the feed-forward network.
  • Figure 4: Comparisons with previous ViT-based tasks. In contrast to previous tasks that use feature maps, ViTGaze leverages interaction information encoded in the attention maps.
  • Figure 5: Visualization of main components of ViTGaze on the GazeFollow Recasens_GazeFollow_2015_NIPS dataset. We observe that the 2D spatial guidance (the 2-nd column) successfully extracts the patches overlapped with the head, and the multi-level interaction maps (the 3-rd and 4-th columns) reflect the region interactions. Specifically, the 6-th layer captures local regions, whereas the 12-th layer has an advanced capacity for global modeling.
  • ...and 3 more figures