Table of Contents
Fetching ...

Enhancing 3D Gaze Estimation in the Wild using Weak Supervision with Gaze Following Labels

Pierre Vuillecard, Jean-Marc Odobez

TL;DR

The paper addresses robust 3D gaze estimation in unconstrained real-world settings where in-the-wild data are limited. It introduces ST-WSGE, a two-stage self-training framework that exploits 2D gaze-following data to create 3D pseudo-labels, and GaT, a modality-agnostic Gaze Transformer that processes image and video inputs. Stage 1 trains on existing 3D gaze datasets, Stage 2 infers 3D gaze on 2D gaze-following data and applies a geometric alignment to generate pseudo labels for retraining, with a temporal angular loss guiding optimization. The approach achieves state-of-the-art or competitive generalization on Gaze360 and GFIE, demonstrates strong cross-modal gains in video gaze estimation, and improves cross-domain performance on MPIIFaceGaze, highlighting the practical potential for robust gaze tracking in the wild, with code and models to be released.

Abstract

Accurate 3D gaze estimation in unconstrained real-world environments remains a significant challenge due to variations in appearance, head pose, occlusion, and the limited availability of in-the-wild 3D gaze datasets. To address these challenges, we introduce a novel Self-Training Weakly-Supervised Gaze Estimation framework (ST-WSGE). This two-stage learning framework leverages diverse 2D gaze datasets, such as gaze-following data, which offer rich variations in appearances, natural scenes, and gaze distributions, and proposes an approach to generate 3D pseudo-labels and enhance model generalization. Furthermore, traditional modality-specific models, designed separately for images or videos, limit the effective use of available training data. To overcome this, we propose the Gaze Transformer (GaT), a modality-agnostic architecture capable of simultaneously learning static and dynamic gaze information from both image and video datasets. By combining 3D video datasets with 2D gaze target labels from gaze following tasks, our approach achieves the following key contributions: (i) Significant state-of-the-art improvements in within-domain and cross-domain generalization on unconstrained benchmarks like Gaze360 and GFIE, with notable cross-modal gains in video gaze estimation; (ii) Superior cross-domain performance on datasets such as MPIIFaceGaze and Gaze360 compared to frontal face methods. Code and pre-trained models will be released to the community.

Enhancing 3D Gaze Estimation in the Wild using Weak Supervision with Gaze Following Labels

TL;DR

The paper addresses robust 3D gaze estimation in unconstrained real-world settings where in-the-wild data are limited. It introduces ST-WSGE, a two-stage self-training framework that exploits 2D gaze-following data to create 3D pseudo-labels, and GaT, a modality-agnostic Gaze Transformer that processes image and video inputs. Stage 1 trains on existing 3D gaze datasets, Stage 2 infers 3D gaze on 2D gaze-following data and applies a geometric alignment to generate pseudo labels for retraining, with a temporal angular loss guiding optimization. The approach achieves state-of-the-art or competitive generalization on Gaze360 and GFIE, demonstrates strong cross-modal gains in video gaze estimation, and improves cross-domain performance on MPIIFaceGaze, highlighting the practical potential for robust gaze tracking in the wild, with code and models to be released.

Abstract

Accurate 3D gaze estimation in unconstrained real-world environments remains a significant challenge due to variations in appearance, head pose, occlusion, and the limited availability of in-the-wild 3D gaze datasets. To address these challenges, we introduce a novel Self-Training Weakly-Supervised Gaze Estimation framework (ST-WSGE). This two-stage learning framework leverages diverse 2D gaze datasets, such as gaze-following data, which offer rich variations in appearances, natural scenes, and gaze distributions, and proposes an approach to generate 3D pseudo-labels and enhance model generalization. Furthermore, traditional modality-specific models, designed separately for images or videos, limit the effective use of available training data. To overcome this, we propose the Gaze Transformer (GaT), a modality-agnostic architecture capable of simultaneously learning static and dynamic gaze information from both image and video datasets. By combining 3D video datasets with 2D gaze target labels from gaze following tasks, our approach achieves the following key contributions: (i) Significant state-of-the-art improvements in within-domain and cross-domain generalization on unconstrained benchmarks like Gaze360 and GFIE, with notable cross-modal gains in video gaze estimation; (ii) Superior cross-domain performance on datasets such as MPIIFaceGaze and Gaze360 compared to frontal face methods. Code and pre-trained models will be released to the community.

Paper Structure

This paper contains 22 sections, 2 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Significance of ST-WSGE. Our self-training based weakly-supervised framework for robust 3D gaze estimation in real-world conditions (e.g., varying appearance, extreme poses, resolution, and occlusion). All predictions used our image and video agnostic Gaze Transformer (GaT) model. Top row: importance of the training diversity using ST-WSGE and GazeFollow (GF) for generalization compared to standard supervised methods. Bottom row: influence of temporal context between image and video inference. Circles in images represent unit disks where 3D gaze vectors are projected onto the image plane (x, y in yellow) and a top-down view (x, z in blue). Images from VideoAttentionTarget, GFIE, and MPIIFaceGaze datasets.
  • Figure 2: Our ST-WSGE training framework. 1. In the first stage, we train a Gaze Transformer (GaT) on both image and video 3D gaze datasets. 2. Using the trained network, 3D gaze is inferred on 2D gaze dataset. Then, a geometric rotation is applied to generate a pseudo 3D gaze label from the inferred 3D gaze that is aligned to the 2D ground truth gaze label in the image plane. 3. In the second stage, we train a similar gaze network as in 1. using available 3D gaze datasets and gaze following datasets with 3D pseudo labels.
  • Figure 3: Dataset gaze distribution. Gaze in polar coordinates.
  • Figure S1: Input head crop using different scales. In our work, a scale of -0.1 is used and proved to be effective in both constrained and frontal face setting \ref{['sec:crop_size']}
  • Figure S2: Effect of head crop size.
  • ...and 4 more figures