Table of Contents
Fetching ...

A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction

Anshul Gupta, Samy Tafasca, Arya Farkhondeh, Pierre Vuillecard, Jean-Marc Odobez

TL;DR

This paper introduces a novel framework to jointly predict the gaze target and social gaze label for all people in the scene and shows that the model trained on VSGaze can address all tasks jointly, and achieves state-of-the-art results for multi-person gaze following and social gaze prediction.

Abstract

Gaze following and social gaze prediction are fundamental tasks providing insights into human communication behaviors, intent, and social interactions. Most previous approaches addressed these tasks separately, either by designing highly specialized social gaze models that do not generalize to other social gaze tasks or by considering social gaze inference as an ad-hoc post-processing of the gaze following task. Furthermore, the vast majority of gaze following approaches have proposed static models that can handle only one person at a time, therefore failing to take advantage of social interactions and temporal dynamics. In this paper, we address these limitations and introduce a novel framework to jointly predict the gaze target and social gaze label for all people in the scene. The framework comprises of: (i) a temporal, transformer-based architecture that, in addition to image tokens, handles person-specific tokens capturing the gaze information related to each individual; (ii) a new dataset, VSGaze, that unifies annotation types across multiple gaze following and social gaze datasets. We show that our model trained on VSGaze can address all tasks jointly, and achieves state-of-the-art results for multi-person gaze following and social gaze prediction.

A Novel Framework for Multi-Person Temporal Gaze Following and Social Gaze Prediction

TL;DR

This paper introduces a novel framework to jointly predict the gaze target and social gaze label for all people in the scene and shows that the model trained on VSGaze can address all tasks jointly, and achieves state-of-the-art results for multi-person gaze following and social gaze prediction.

Abstract

Gaze following and social gaze prediction are fundamental tasks providing insights into human communication behaviors, intent, and social interactions. Most previous approaches addressed these tasks separately, either by designing highly specialized social gaze models that do not generalize to other social gaze tasks or by considering social gaze inference as an ad-hoc post-processing of the gaze following task. Furthermore, the vast majority of gaze following approaches have proposed static models that can handle only one person at a time, therefore failing to take advantage of social interactions and temporal dynamics. In this paper, we address these limitations and introduce a novel framework to jointly predict the gaze target and social gaze label for all people in the scene. The framework comprises of: (i) a temporal, transformer-based architecture that, in addition to image tokens, handles person-specific tokens capturing the gaze information related to each individual; (ii) a new dataset, VSGaze, that unifies annotation types across multiple gaze following and social gaze datasets. We show that our model trained on VSGaze can address all tasks jointly, and achieves state-of-the-art results for multi-person gaze following and social gaze prediction.
Paper Structure (28 sections, 10 equations, 6 figures, 8 tables)

This paper contains 28 sections, 10 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Results of our proposed multi-person and temporal transformer architecture for joint gaze following and social gaze prediction, namely Looking at Humans (LAH), Looking at Each Other (LAEO), and Shared Attention (SA). For each person, the social gaze task is listed with the associated person ID (e.g. in frame 1 for person 2, they are in SA with person 4). More qualitative results can be found in the supplementary.
  • Figure 2: Proposed architecture for multi-person temporal gaze following and social gaze prediction. First, the Person Module (left) processes the set of head crops and bounding boxes to extract a sequence of person tokens for each person. In parallel, the ViT tokenizer processes the sequence of frames to extract frame tokens. Next, the Interaction Module (middle) jointly processes the person and frame tokens, iteratively updating them through people-scene interactions and spatio-temporal social interactions. Finally, the Prediction Module (right) processes the resulting frame and person tokens to infer a sequence of gaze heatmaps and in-out gaze labels for each person, as well as pair-wise social gaze labels for LAH, LAEO, and SA.
  • Figure 3: An illustration of the few cases where the predicted gaze point does not match with the predicted LAH label. The uncertainty in the gaze target is reflected in the heatmap, while the uncertainty in the LAH target is reflected in the LAH scores.
  • Figure 4: Annotation statistics and samples for ChildPlay-audio.
  • Figure 5: The standard DPT (a, taken from ranftl2021vision_dpt) and our proposed person-conditioned re-assemble stage (b). This transformed DPT is used for predicting gaze heatmaps for each person in the scene.
  • ...and 1 more figures