Table of Contents
Fetching ...

UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training

Jiawei Qin, Xucong Zhang, Yusuke Sugano

TL;DR

UniGaze tackles the persistent problem of cross-domain generalization in appearance-based gaze estimation by introducing a large-scale self-supervised pre-training strategy tailored to facial geometry. Using MAE pre-training on a diverse, normalized face dataset that blends real and synthetic sources, it learns gaze-relevant representations that transfer effectively to downstream gaze tasks. The approach yields substantial improvements across cross-dataset, leave-one-dataset-out, and joint-dataset evaluations, outperforming semantic-pretraining baselines and domain-generalization methods, particularly with ViT backbones. Critical findings include the necessity of input normalization, broad head-pose coverage, and identity diversity, as well as the value of mixed real/synthetic/novel-view data for robust gaze modeling. These results provide practical guidelines for robust gaze estimation in unconstrained, real-world applications and are accompanied by an open-source implementation.

Abstract

Despite decades of research on data collection and model architectures, current gaze estimation models encounter significant challenges in generalizing across diverse data domains. Recent advances in self-supervised pre-training have shown remarkable performances in generalization across various vision tasks. However, their effectiveness in gaze estimation remains unexplored. We propose UniGaze, for the first time, leveraging large-scale in-the-wild facial datasets for gaze estimation through self-supervised pre-training. Through systematic investigation, we clarify critical factors that are essential for effective pretraining in gaze estimation. Our experiments reveal that self-supervised approaches designed for semantic tasks fail when applied to gaze estimation, while our carefully designed pre-training pipeline consistently improves cross-domain performance. Through comprehensive experiments of challenging cross-dataset evaluation and novel protocols including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. source code and model are available at https://github.com/ut-vision/UniGaze.

UniGaze: Towards Universal Gaze Estimation via Large-scale Pre-Training

TL;DR

UniGaze tackles the persistent problem of cross-domain generalization in appearance-based gaze estimation by introducing a large-scale self-supervised pre-training strategy tailored to facial geometry. Using MAE pre-training on a diverse, normalized face dataset that blends real and synthetic sources, it learns gaze-relevant representations that transfer effectively to downstream gaze tasks. The approach yields substantial improvements across cross-dataset, leave-one-dataset-out, and joint-dataset evaluations, outperforming semantic-pretraining baselines and domain-generalization methods, particularly with ViT backbones. Critical findings include the necessity of input normalization, broad head-pose coverage, and identity diversity, as well as the value of mixed real/synthetic/novel-view data for robust gaze modeling. These results provide practical guidelines for robust gaze estimation in unconstrained, real-world applications and are accompanied by an open-source implementation.

Abstract

Despite decades of research on data collection and model architectures, current gaze estimation models encounter significant challenges in generalizing across diverse data domains. Recent advances in self-supervised pre-training have shown remarkable performances in generalization across various vision tasks. However, their effectiveness in gaze estimation remains unexplored. We propose UniGaze, for the first time, leveraging large-scale in-the-wild facial datasets for gaze estimation through self-supervised pre-training. Through systematic investigation, we clarify critical factors that are essential for effective pretraining in gaze estimation. Our experiments reveal that self-supervised approaches designed for semantic tasks fail when applied to gaze estimation, while our carefully designed pre-training pipeline consistently improves cross-domain performance. Through comprehensive experiments of challenging cross-dataset evaluation and novel protocols including leave-one-dataset-out and joint-dataset settings, we demonstrate that UniGaze significantly improves generalization across multiple data domains while minimizing reliance on costly labeled data. source code and model are available at https://github.com/ut-vision/UniGaze.

Paper Structure

This paper contains 39 sections, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Examples comparison of the pixel normalization during the MAE pre-training. The left, middle, and right columns show the original image, masked input, and the reconstructed image, respectively.
  • Figure 2: Example of the normalized facial images from different datasets in the pre-training stage. We also draw their head pose distributions where the vertical axis is the pitch rotation angle and the horizontal axis is the yaw rotation angle in degrees.
  • Figure 2: Effect of MAE pre-training dataset composition on downstream gaze estimation performance. The horizontal axis represents the incremental accumulation of datasets, while the vertical axis shows the percentage reduction in error relative to the first CelebV-Text dataset yu2023celebv.
  • Figure 3: Effect of MAE pre-training data size on gaze estimation performance. The horizontal axis is the percentage of the pre-training data and the vertical axis is the percentage of the error reduction from the 0% baseline.
  • Figure 3: Qualitative results from various in-the-wild video examples. The normalized input images are displayed alongside the original image for reference.
  • ...and 2 more figures