Table of Contents
Fetching ...

DMAGaze: Gaze Estimation Based on Feature Disentanglement and Multi-Scale Attention

Haohan Chen, Hongjia Liu, Shiyong Lan, Wenwu Wang, Yixin Qiao, Yao Li, Guonan Deng

TL;DR

DMAGaze tackles gaze estimation under gaze-irrelevant facial noise by disentangling global gaze-relevant features from gaze-irrelevant content and fusing them with local eye features and head pose. It introduces a Disentangler with continuous masks and a Multi-Scale Global-Local Attention Module (MS-GLAM) that incorporates Gaussian Modulated Weighting (GMW-Non-Local) for nonlinear global dependencies. The method achieves state-of-the-art angular errors on MPIIFaceGaze ($3.74^\circ$) and Rt-Gene ($6.17^\circ$) and is validated through extensive ablations, confirming the contributions of disentangling, head pose integration, and multi-scale attention. Overall, DMAGaze offers robust improvements for gaze estimation with strong implications for human-computer interaction and related applications.

Abstract

Gaze estimation, which predicts gaze direction, commonly faces the challenge of interference from complex gaze-irrelevant information in face images. In this work, we propose DMAGaze, a novel gaze estimation framework that exploits information from facial images in three aspects: gaze-relevant global features (disentangled from facial image), local eye features (extracted from cropped eye patch), and head pose estimation features, to improve overall performance. Firstly, we design a new continuous mask-based Disentangler to accurately disentangle gaze-relevant and gaze-irrelevant information in facial images by achieving the dual-branch disentanglement goal through separately reconstructing the eye and non-eye regions. Furthermore, we introduce a new cascaded attention module named Multi-Scale Global Local Attention Module (MS-GLAM). Through a customized cascaded attention structure, it effectively focuses on global and local information at multiple scales, further enhancing the information from the Disentangler. Finally, the global gaze-relevant features disentangled by the upper face branch, combined with head pose and local eye features, are passed through the detection head for high-precision gaze estimation. Our proposed DMAGaze has been extensively validated on two mainstream public datasets, achieving state-of-the-art performance.

DMAGaze: Gaze Estimation Based on Feature Disentanglement and Multi-Scale Attention

TL;DR

DMAGaze tackles gaze estimation under gaze-irrelevant facial noise by disentangling global gaze-relevant features from gaze-irrelevant content and fusing them with local eye features and head pose. It introduces a Disentangler with continuous masks and a Multi-Scale Global-Local Attention Module (MS-GLAM) that incorporates Gaussian Modulated Weighting (GMW-Non-Local) for nonlinear global dependencies. The method achieves state-of-the-art angular errors on MPIIFaceGaze () and Rt-Gene () and is validated through extensive ablations, confirming the contributions of disentangling, head pose integration, and multi-scale attention. Overall, DMAGaze offers robust improvements for gaze estimation with strong implications for human-computer interaction and related applications.

Abstract

Gaze estimation, which predicts gaze direction, commonly faces the challenge of interference from complex gaze-irrelevant information in face images. In this work, we propose DMAGaze, a novel gaze estimation framework that exploits information from facial images in three aspects: gaze-relevant global features (disentangled from facial image), local eye features (extracted from cropped eye patch), and head pose estimation features, to improve overall performance. Firstly, we design a new continuous mask-based Disentangler to accurately disentangle gaze-relevant and gaze-irrelevant information in facial images by achieving the dual-branch disentanglement goal through separately reconstructing the eye and non-eye regions. Furthermore, we introduce a new cascaded attention module named Multi-Scale Global Local Attention Module (MS-GLAM). Through a customized cascaded attention structure, it effectively focuses on global and local information at multiple scales, further enhancing the information from the Disentangler. Finally, the global gaze-relevant features disentangled by the upper face branch, combined with head pose and local eye features, are passed through the detection head for high-precision gaze estimation. Our proposed DMAGaze has been extensively validated on two mainstream public datasets, achieving state-of-the-art performance.

Paper Structure

This paper contains 19 sections, 28 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: This figure illustrates the comparison between traditional gaze estimation methods and our proposed method. The gray pathway represents the basic framework of many previous methods, while the black pathway represents our proposed framework for disentangling gaze-relevant and -irrelevant facial information to further explore the complex gaze relationship between eyes and face.
  • Figure 2: The overall architecture of our proposed DMAGaze. The top illustrates the overall workflow of our model. The bottom left illustrates the propagation of the loss functions. The bottom right illustrates the GMW-Non-Local module.
  • Figure 3: The data distribution of MPIIFaceGaze and Rt-Gene dataset.
  • Figure 4: The visualization of the attention maps of the upper face branch and lower face branch of DMAGaze we proposed. (a) Input images from MPIIFaceGaze dataset. (b) Attention maps from the upper face branch after Disentangler, which is mainly responsible for reconstructing the eye region. (c) Attention maps from the lower face branch after Disentangler, which is dedicated to reconstructing non-eye regions.
  • Figure 5: The visualization of gaze estimation results in different components of our gaze estimation model from ablation studies, covering scenarios such as wearing glasses, variable lighting and head poses. The green line represents the ground truth gaze and the red line represents the estimated gaze.