Table of Contents
Fetching ...

Domain-Adaptive Full-Face Gaze Estimation via Novel-View-Synthesis and Feature Disentanglement

Jiawei Qin, Takuru Shimoyama, Xucong Zhang, Yusuke Sugano

TL;DR

This work tackles cross-domain appearance-based gaze estimation by synthesizing realistic full-face training data from single-view images using novel-view reconstruction, then bridging synthetic-real gaps with a disentangling auto-encoder and self-training. The data synthesis relies on 3D face reconstruction and projective matching to generate accurate head poses and gaze labels, while rendering with varied lighting and backgrounds to enrich appearance. The DisAE learns gaze-related representations by disentangling appearance, head pose, and gaze, and is further adapted to target domains through augmentation-based self-training, including a background-switching consistency loss. Across multiple target domains, the method achieves state-of-the-art cross-dataset and unsupervised domain adaptation performance, and experiments indicate synthetic data can approach real data performance, highlighting the practical potential of synthetic training for gaze estimation.

Abstract

Along with the recent development of deep neural networks, appearance-based gaze estimation has succeeded considerably when training and testing within the same domain. Compared to the within-domain task, the variance of different domains makes the cross-domain performance drop severely, preventing gaze estimation deployment in real-world applications. Among all the factors, ranges of head pose and gaze are believed to play significant roles in the final performance of gaze estimation, while collecting large ranges of data is expensive. This work proposes an effective model training pipeline consisting of a training data synthesis and a gaze estimation model for unsupervised domain adaptation. The proposed data synthesis leverages the single-image 3D reconstruction to expand the range of the head poses from the source domain without requiring a 3D facial shape dataset. To bridge the inevitable gap between synthetic and real images, we further propose an unsupervised domain adaptation method suitable for synthetic full-face data. We propose a disentangling autoencoder network to separate gaze-related features and introduce background augmentation consistency loss to utilize the characteristics of the synthetic source domain. Through comprehensive experiments, it shows that the model using only our synthetic training data can perform comparably to real data extended with a large label range. Our proposed domain adaptation approach further improves the performance on multiple target domains. The code and data will be available at https://github.com/ut-vision/AdaptiveGaze.

Domain-Adaptive Full-Face Gaze Estimation via Novel-View-Synthesis and Feature Disentanglement

TL;DR

This work tackles cross-domain appearance-based gaze estimation by synthesizing realistic full-face training data from single-view images using novel-view reconstruction, then bridging synthetic-real gaps with a disentangling auto-encoder and self-training. The data synthesis relies on 3D face reconstruction and projective matching to generate accurate head poses and gaze labels, while rendering with varied lighting and backgrounds to enrich appearance. The DisAE learns gaze-related representations by disentangling appearance, head pose, and gaze, and is further adapted to target domains through augmentation-based self-training, including a background-switching consistency loss. Across multiple target domains, the method achieves state-of-the-art cross-dataset and unsupervised domain adaptation performance, and experiments indicate synthetic data can approach real data performance, highlighting the practical potential of synthetic training for gaze estimation.

Abstract

Along with the recent development of deep neural networks, appearance-based gaze estimation has succeeded considerably when training and testing within the same domain. Compared to the within-domain task, the variance of different domains makes the cross-domain performance drop severely, preventing gaze estimation deployment in real-world applications. Among all the factors, ranges of head pose and gaze are believed to play significant roles in the final performance of gaze estimation, while collecting large ranges of data is expensive. This work proposes an effective model training pipeline consisting of a training data synthesis and a gaze estimation model for unsupervised domain adaptation. The proposed data synthesis leverages the single-image 3D reconstruction to expand the range of the head poses from the source domain without requiring a 3D facial shape dataset. To bridge the inevitable gap between synthetic and real images, we further propose an unsupervised domain adaptation method suitable for synthetic full-face data. We propose a disentangling autoencoder network to separate gaze-related features and introduce background augmentation consistency loss to utilize the characteristics of the synthetic source domain. Through comprehensive experiments, it shows that the model using only our synthetic training data can perform comparably to real data extended with a large label range. Our proposed domain adaptation approach further improves the performance on multiple target domains. The code and data will be available at https://github.com/ut-vision/AdaptiveGaze.
Paper Structure (26 sections, 5 equations, 7 figures, 4 tables)

This paper contains 26 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of our approach with two stages. (1) With a monocular source image as input, we synthesize the data to be a large range of head poses stemming from the 3D face reconstruction. We propose a feature disentangling auto-encoder network pre-trained only on the synthetic data from the source images. (2) For the unlabeled target domain, We leverage self-training to adapt the model to unlabeled target domains.
  • Figure 2: Overview of the data synthesis pipeline. With monocular image as input, we first obtain the face patch with croping and scaling. We then fit the 3D face model to the input face patch. We assume that 3D face reconstruction methods generate facial meshes under an orthogonal projection model. Through the proposed projective matching, we convert the mesh from the image-pixel system to the physical camera coordinate system. After this process, the 3D face is aligned with the ground-truth gaze position (in the physical camera coordinate system), thus we can rotate the 3D face to simulate different head poses.
  • Figure 3: Determining the location of $\mathcal{V}_{c}$ via parameters $\alpha$ and $\beta$. $\alpha$ indicates a scaling factor from the pixel to physical (e.g, millimeter) unit, and $\beta$ is the bias term to align $\alpha d$ to the camera coordinate system.
  • Figure 4: Examples of the synthesized images. The first row shows the source images from MPIIFaceGaze swcnn_zhang2017s and ETH-XGaze Zhang2020ETHXGaze datasets. For MPIIFaceGaze, the second and third rows show synthesized images in full and weak lighting. For ETH-XGaze, the second row shows the real images from the dataset, and the third row shows our synthetic images with the same head poses as the real samples. For each synthetic example, the three columns show the black, color, and scene background in turn. The red arrows indicate gaze direction vectors.
  • Figure 5: The overview of our synthetic-real domain adaptation approach. Top: An encoder-decoder structure for feature disentanglement (DisAE). We prepare three subnets $\psi^{\textrm{a}}$, $\psi^{\textrm{h}}$, and $\psi^{\textrm{g}}$ to disentangle appearance, head and gaze features, respectively. The gaze features are fed into a vision transformer to get the predicted gaze direction $\hat{\bm{g}}$ and the head features are fed into an MLP to get the predicted head pose direction $\hat{\bm{h}}$. Bottom: augmentation consistency is proposed during the unsupervised domain adaptation of DisAE towards the target domain.
  • ...and 2 more figures