Domain-Adaptive Full-Face Gaze Estimation via Novel-View-Synthesis and Feature Disentanglement
Jiawei Qin, Takuru Shimoyama, Xucong Zhang, Yusuke Sugano
TL;DR
This work tackles cross-domain appearance-based gaze estimation by synthesizing realistic full-face training data from single-view images using novel-view reconstruction, then bridging synthetic-real gaps with a disentangling auto-encoder and self-training. The data synthesis relies on 3D face reconstruction and projective matching to generate accurate head poses and gaze labels, while rendering with varied lighting and backgrounds to enrich appearance. The DisAE learns gaze-related representations by disentangling appearance, head pose, and gaze, and is further adapted to target domains through augmentation-based self-training, including a background-switching consistency loss. Across multiple target domains, the method achieves state-of-the-art cross-dataset and unsupervised domain adaptation performance, and experiments indicate synthetic data can approach real data performance, highlighting the practical potential of synthetic training for gaze estimation.
Abstract
Along with the recent development of deep neural networks, appearance-based gaze estimation has succeeded considerably when training and testing within the same domain. Compared to the within-domain task, the variance of different domains makes the cross-domain performance drop severely, preventing gaze estimation deployment in real-world applications. Among all the factors, ranges of head pose and gaze are believed to play significant roles in the final performance of gaze estimation, while collecting large ranges of data is expensive. This work proposes an effective model training pipeline consisting of a training data synthesis and a gaze estimation model for unsupervised domain adaptation. The proposed data synthesis leverages the single-image 3D reconstruction to expand the range of the head poses from the source domain without requiring a 3D facial shape dataset. To bridge the inevitable gap between synthetic and real images, we further propose an unsupervised domain adaptation method suitable for synthetic full-face data. We propose a disentangling autoencoder network to separate gaze-related features and introduce background augmentation consistency loss to utilize the characteristics of the synthetic source domain. Through comprehensive experiments, it shows that the model using only our synthetic training data can perform comparably to real data extended with a large label range. Our proposed domain adaptation approach further improves the performance on multiple target domains. The code and data will be available at https://github.com/ut-vision/AdaptiveGaze.
