Table of Contents
Fetching ...

Hybrid-Domain Adaptative Representation Learning for Gaze Estimation

Qida Tan, Hongyu Yang, Wenchao Du

TL;DR

This work tackles cross-domain degradation in appearance-based gaze estimation by introducing HARL, a hybrid-domain adaptive representation learning framework. HARL uses high-quality near-eye data as a source domain and low-quality facial images as the target, aligning the inverse Gram matrix $(Z^TZ)^+$ across domains via angular and scale constraints to disentangle gaze-relevant features with a shared regressor $R$, i.e., minimizing $\mathcal{L}_{UDA}=\mathcal{L}_{Angle} + \mathcal{L}_{Scale}$. It further fuses gaze and head-pose information through a Sparse-Graph Fusion module, where a graph over per-dimension features is refined by top-$k$ neighbor connections to produce a dense gaze representation $Z_G$ for regression. The model is trained end-to-end with a joint loss $\mathcal{L}_{total}=\text{MSE}(\mathcal{G}_{Face}, \hat{\mathcal{G}}_{Face}) + \text{MSE}(\mathcal{G}_{eye}, \hat{\mathcal{G}}_{eye}) + \lambda \mathcal{L}_{UDA}(Z_s,Z_t)$ and demonstrates state-of-the-art accuracy on MPIIFaceGaze ($3.36^{\circ}$), EyeDiap ($5.02^{\circ}$), and Gaze360 ($9.26^{\circ}$), with strong cross-domain performance and minimal inference overhead. By leveraging hybrid-domain supervision and geometry-aware fusion, HARL provides a practical, scalable solution for robust gaze estimation in unconstrained settings.

Abstract

Appearance-based gaze estimation, aiming to predict accurate 3D gaze direction from a single facial image, has made promising progress in recent years. However, most methods suffer significant performance degradation in cross-domain evaluation due to interference from gaze-irrelevant factors, such as expressions, wearables, and image quality. To alleviate this problem, we present a novel Hybrid-domain Adaptative Representation Learning (shorted by HARL) framework that exploits multi-source hybrid datasets to learn robust gaze representation. More specifically, we propose to disentangle gaze-relevant representation from low-quality facial images by aligning features extracted from high-quality near-eye images in an unsupervised domain-adaptation manner, which hardly requires any computational or inference costs. Additionally, we analyze the effect of head-pose and design a simple yet efficient sparse graph fusion module to explore the geometric constraint between gaze direction and head-pose, leading to a dense and robust gaze representation. Extensive experiments on EyeDiap, MPIIFaceGaze, and Gaze360 datasets demonstrate that our approach achieves state-of-the-art accuracy of $\textbf{5.02}^{\circ}$ and $\textbf{3.36}^{\circ}$, and $\textbf{9.26}^{\circ}$ respectively, and present competitive performances through cross-dataset evaluation. The code is available at https://github.com/da60266/HARL.

Hybrid-Domain Adaptative Representation Learning for Gaze Estimation

TL;DR

This work tackles cross-domain degradation in appearance-based gaze estimation by introducing HARL, a hybrid-domain adaptive representation learning framework. HARL uses high-quality near-eye data as a source domain and low-quality facial images as the target, aligning the inverse Gram matrix across domains via angular and scale constraints to disentangle gaze-relevant features with a shared regressor , i.e., minimizing . It further fuses gaze and head-pose information through a Sparse-Graph Fusion module, where a graph over per-dimension features is refined by top- neighbor connections to produce a dense gaze representation for regression. The model is trained end-to-end with a joint loss and demonstrates state-of-the-art accuracy on MPIIFaceGaze (), EyeDiap (), and Gaze360 (), with strong cross-domain performance and minimal inference overhead. By leveraging hybrid-domain supervision and geometry-aware fusion, HARL provides a practical, scalable solution for robust gaze estimation in unconstrained settings.

Abstract

Appearance-based gaze estimation, aiming to predict accurate 3D gaze direction from a single facial image, has made promising progress in recent years. However, most methods suffer significant performance degradation in cross-domain evaluation due to interference from gaze-irrelevant factors, such as expressions, wearables, and image quality. To alleviate this problem, we present a novel Hybrid-domain Adaptative Representation Learning (shorted by HARL) framework that exploits multi-source hybrid datasets to learn robust gaze representation. More specifically, we propose to disentangle gaze-relevant representation from low-quality facial images by aligning features extracted from high-quality near-eye images in an unsupervised domain-adaptation manner, which hardly requires any computational or inference costs. Additionally, we analyze the effect of head-pose and design a simple yet efficient sparse graph fusion module to explore the geometric constraint between gaze direction and head-pose, leading to a dense and robust gaze representation. Extensive experiments on EyeDiap, MPIIFaceGaze, and Gaze360 datasets demonstrate that our approach achieves state-of-the-art accuracy of and , and respectively, and present competitive performances through cross-dataset evaluation. The code is available at https://github.com/da60266/HARL.

Paper Structure

This paper contains 21 sections, 14 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) Sample quality comparison between in-the-wild and near-eye datasets. (b) Performance comparisons between in-the-wild cheng2022gazecheng2020coarsekellnhofer2019gaze360 and near-eye datasetsyiu2019deepvogpopovic2023modelfeng2022real respectively.
  • Figure 2: Overview of our framework. The proposed method explores a dual branch network to learning monocular and pose representation, a sparse-graph fusion module is used to generate dense gaze representation.
  • Figure 3: Visualized results from different datasets. The red and green arrows denote the predicted and ground truth gaze vector, separately.
  • Figure 4: Monocular gaze feature distribution visualization. Different colors denote features from different samples.
  • Figure 5: Visualization of the full-face gaze feature distribution. Different colors denotes different gaze directions and close gaze direction share similar colors.