Table of Contents
Fetching ...

Upper-Body Pose-based Gaze Estimation for Privacy-Preserving 3D Gaze Target Detection

Andrea Toaiari, Vittorio Murino, Marco Cristani, Cigdem Beyan

TL;DR

This paper utilizing the person's upper-body pose and available depth maps to extract a 3D gaze direction and employing a multi-stage or an end-to-end pipeline to predict the gazed target without requiring images of the person's face, thus promoting privacy preservation in various application contexts.

Abstract

Gaze Target Detection (GTD), i.e., determining where a person is looking within a scene from an external viewpoint, is a challenging task, particularly in 3D space. Existing approaches heavily rely on analyzing the person's appearance, primarily focusing on their face to predict the gaze target. This paper presents a novel approach to tackle this problem by utilizing the person's upper-body pose and available depth maps to extract a 3D gaze direction and employing a multi-stage or an end-to-end pipeline to predict the gazed target. When predicted accurately, the human body pose can provide valuable information about the head pose, which is a good approximation of the gaze direction, as well as the position of the arms and hands, which are linked to the activity the person is performing and the objects they are likely focusing on. Consequently, in addition to performing gaze estimation in 3D, we are also able to perform GTD simultaneously. We demonstrate state-of-the-art results on the most comprehensive publicly accessible 3D gaze target detection dataset without requiring images of the person's face, thus promoting privacy preservation in various application contexts. The code is available at https://github.com/intelligolabs/privacy-gtd-3D.

Upper-Body Pose-based Gaze Estimation for Privacy-Preserving 3D Gaze Target Detection

TL;DR

This paper utilizing the person's upper-body pose and available depth maps to extract a 3D gaze direction and employing a multi-stage or an end-to-end pipeline to predict the gazed target without requiring images of the person's face, thus promoting privacy preservation in various application contexts.

Abstract

Gaze Target Detection (GTD), i.e., determining where a person is looking within a scene from an external viewpoint, is a challenging task, particularly in 3D space. Existing approaches heavily rely on analyzing the person's appearance, primarily focusing on their face to predict the gaze target. This paper presents a novel approach to tackle this problem by utilizing the person's upper-body pose and available depth maps to extract a 3D gaze direction and employing a multi-stage or an end-to-end pipeline to predict the gazed target. When predicted accurately, the human body pose can provide valuable information about the head pose, which is a good approximation of the gaze direction, as well as the position of the arms and hands, which are linked to the activity the person is performing and the objects they are likely focusing on. Consequently, in addition to performing gaze estimation in 3D, we are also able to perform GTD simultaneously. We demonstrate state-of-the-art results on the most comprehensive publicly accessible 3D gaze target detection dataset without requiring images of the person's face, thus promoting privacy preservation in various application contexts. The code is available at https://github.com/intelligolabs/privacy-gtd-3D.
Paper Structure (19 sections, 3 figures, 2 tables)

This paper contains 19 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: We introduce a novel way to tackle the 3D gaze estimation and 3D gaze target detection problems that, differently from the previous approaches, does not use the head crops of the observed people but rather employs upper-body skeletons, depth maps, and scene images where the head is blurred out. The new pipeline not only proved successful in tackling both tasks but offers a concrete way to preserve the identity of the involved people. The currently available state-of-the-art method already used depth maps (in magenta) in its pipeline, so we are not introducing additional information.
  • Figure 2: Overview of the proposed approach. a) The 3D Gaze Estimation Module predicts a gaze vector by exploiting the upper-body pose coordinates, processed by a simple MLP, and the convolutional features extracted from the depth map by a ResNet50 he2016deep. A single multi-head attention layer is applied to the concatenated features before the final MLP. b) The perception module converts the depth map to the unprojected point cloud, from which the 3D eye coordinates are subtracted, and computes two heatmaps highlighting the most interesting part of the scene. c) The 2D gaze target heatmap is predicted by an encoder-decoder architecture operating on the concatenation of the scene image (with the blurred-out face), the head location mask and the two heatmaps. d) During the evaluation, the most similar vector to the predicted vector $\hat{g}$ in the unprojected and translated point cloud is selected as the final gaze vector and the corresponding point in the original point cloud as the predicted 3D gaze target.
  • Figure 3: Qualitative results of our method on the GFIE dataset hu2023gfie. Each row represents a single sample. First Column: We use resized depth maps as input for our gaze estimation module. Second Column: The estimated upper-body skeletons are visualized on a black background before being normalized. Third Column: Scene images where the faces are blurred out before being used by the encoder-decoder module to predict the 2D gaze target heatmap. Fourth Column: The final point cloud of the scene, with the ground truth gaze vector in red and the estimated gaze vector in blue.