Hybrid-Domain Adaptative Representation Learning for Gaze Estimation
Qida Tan, Hongyu Yang, Wenchao Du
TL;DR
This work tackles cross-domain degradation in appearance-based gaze estimation by introducing HARL, a hybrid-domain adaptive representation learning framework. HARL uses high-quality near-eye data as a source domain and low-quality facial images as the target, aligning the inverse Gram matrix $(Z^TZ)^+$ across domains via angular and scale constraints to disentangle gaze-relevant features with a shared regressor $R$, i.e., minimizing $\mathcal{L}_{UDA}=\mathcal{L}_{Angle} + \mathcal{L}_{Scale}$. It further fuses gaze and head-pose information through a Sparse-Graph Fusion module, where a graph over per-dimension features is refined by top-$k$ neighbor connections to produce a dense gaze representation $Z_G$ for regression. The model is trained end-to-end with a joint loss $\mathcal{L}_{total}=\text{MSE}(\mathcal{G}_{Face}, \hat{\mathcal{G}}_{Face}) + \text{MSE}(\mathcal{G}_{eye}, \hat{\mathcal{G}}_{eye}) + \lambda \mathcal{L}_{UDA}(Z_s,Z_t)$ and demonstrates state-of-the-art accuracy on MPIIFaceGaze ($3.36^{\circ}$), EyeDiap ($5.02^{\circ}$), and Gaze360 ($9.26^{\circ}$), with strong cross-domain performance and minimal inference overhead. By leveraging hybrid-domain supervision and geometry-aware fusion, HARL provides a practical, scalable solution for robust gaze estimation in unconstrained settings.
Abstract
Appearance-based gaze estimation, aiming to predict accurate 3D gaze direction from a single facial image, has made promising progress in recent years. However, most methods suffer significant performance degradation in cross-domain evaluation due to interference from gaze-irrelevant factors, such as expressions, wearables, and image quality. To alleviate this problem, we present a novel Hybrid-domain Adaptative Representation Learning (shorted by HARL) framework that exploits multi-source hybrid datasets to learn robust gaze representation. More specifically, we propose to disentangle gaze-relevant representation from low-quality facial images by aligning features extracted from high-quality near-eye images in an unsupervised domain-adaptation manner, which hardly requires any computational or inference costs. Additionally, we analyze the effect of head-pose and design a simple yet efficient sparse graph fusion module to explore the geometric constraint between gaze direction and head-pose, leading to a dense and robust gaze representation. Extensive experiments on EyeDiap, MPIIFaceGaze, and Gaze360 datasets demonstrate that our approach achieves state-of-the-art accuracy of $\textbf{5.02}^{\circ}$ and $\textbf{3.36}^{\circ}$, and $\textbf{9.26}^{\circ}$ respectively, and present competitive performances through cross-dataset evaluation. The code is available at https://github.com/da60266/HARL.
