DMAGaze: Gaze Estimation Based on Feature Disentanglement and Multi-Scale Attention
Haohan Chen, Hongjia Liu, Shiyong Lan, Wenwu Wang, Yixin Qiao, Yao Li, Guonan Deng
TL;DR
DMAGaze tackles gaze estimation under gaze-irrelevant facial noise by disentangling global gaze-relevant features from gaze-irrelevant content and fusing them with local eye features and head pose. It introduces a Disentangler with continuous masks and a Multi-Scale Global-Local Attention Module (MS-GLAM) that incorporates Gaussian Modulated Weighting (GMW-Non-Local) for nonlinear global dependencies. The method achieves state-of-the-art angular errors on MPIIFaceGaze ($3.74^\circ$) and Rt-Gene ($6.17^\circ$) and is validated through extensive ablations, confirming the contributions of disentangling, head pose integration, and multi-scale attention. Overall, DMAGaze offers robust improvements for gaze estimation with strong implications for human-computer interaction and related applications.
Abstract
Gaze estimation, which predicts gaze direction, commonly faces the challenge of interference from complex gaze-irrelevant information in face images. In this work, we propose DMAGaze, a novel gaze estimation framework that exploits information from facial images in three aspects: gaze-relevant global features (disentangled from facial image), local eye features (extracted from cropped eye patch), and head pose estimation features, to improve overall performance. Firstly, we design a new continuous mask-based Disentangler to accurately disentangle gaze-relevant and gaze-irrelevant information in facial images by achieving the dual-branch disentanglement goal through separately reconstructing the eye and non-eye regions. Furthermore, we introduce a new cascaded attention module named Multi-Scale Global Local Attention Module (MS-GLAM). Through a customized cascaded attention structure, it effectively focuses on global and local information at multiple scales, further enhancing the information from the Disentangler. Finally, the global gaze-relevant features disentangled by the upper face branch, combined with head pose and local eye features, are passed through the detection head for high-precision gaze estimation. Our proposed DMAGaze has been extensively validated on two mainstream public datasets, achieving state-of-the-art performance.
