Table of Contents
Fetching ...

Real-time Appearance-based Gaze Estimation for Open Domains

Zhenhao Li, Zheng Liu, Seunghyun Lee, Amin Fadaeinejad, Yuanhao Yu

Abstract

Appearance-based gaze estimation (AGE) has achieved remarkable performance in constrained settings, yet we reveal a significant generalization gap where existing AGE models often fail in practical, unconstrained scenarios, particularly those involving facial wearables and poor lighting conditions. We attribute this failure to two core factors: limited image diversity and inconsistent label fidelity across different datasets, especially along the pitch axis. To address these, we propose a robust AGE framework that enhances generalization without requiring additional human-annotated data. First, we expand the image manifold via an ensemble of augmentation techniques, including synthesis of eyeglasses, masks, and varied lighting. Second, to mitigate the impact of anisotropic inter-dataset label deviation, we reformulate gaze regression as a multi-task learning problem, incorporating multi-view supervised contrastive (SupCon) learning, discretized label classification, and eye-region segmentation as auxiliary objectives. To rigorously validate our approach, we curate new benchmark datasets designed to evaluate gaze robustness under challenging conditions, a dimension largely overlooked by existing evaluation protocols. Our MobileNet-based lightweight model achieves generalization performance competitive with the state-of-the-art (SOTA) UniGaze-H, while utilizing less than 1\% of its parameters, enabling high-fidelity, real-time gaze tracking on mobile devices.

Real-time Appearance-based Gaze Estimation for Open Domains

Abstract

Appearance-based gaze estimation (AGE) has achieved remarkable performance in constrained settings, yet we reveal a significant generalization gap where existing AGE models often fail in practical, unconstrained scenarios, particularly those involving facial wearables and poor lighting conditions. We attribute this failure to two core factors: limited image diversity and inconsistent label fidelity across different datasets, especially along the pitch axis. To address these, we propose a robust AGE framework that enhances generalization without requiring additional human-annotated data. First, we expand the image manifold via an ensemble of augmentation techniques, including synthesis of eyeglasses, masks, and varied lighting. Second, to mitigate the impact of anisotropic inter-dataset label deviation, we reformulate gaze regression as a multi-task learning problem, incorporating multi-view supervised contrastive (SupCon) learning, discretized label classification, and eye-region segmentation as auxiliary objectives. To rigorously validate our approach, we curate new benchmark datasets designed to evaluate gaze robustness under challenging conditions, a dimension largely overlooked by existing evaluation protocols. Our MobileNet-based lightweight model achieves generalization performance competitive with the state-of-the-art (SOTA) UniGaze-H, while utilizing less than 1\% of its parameters, enabling high-fidelity, real-time gaze tracking on mobile devices.

Paper Structure

This paper contains 65 sections, 2 equations, 23 figures, 9 tables, 2 algorithms.

Figures (23)

  • Figure 1: Comparison of generalization performance between our MobileNet-based model and the SOTA UniGaze unigaze. (a) On our benchmark datasets RealGaze (top) and ZeroGaze (bottom), UniGaze-H (red arrows) manifests significantly higher prediction variance under occlusion. (b) Our lightweight model maintains superior robustness and manifold consistency compared to significantly larger baselines, effectively marginalizing the impact of visual alterations. Detailed analysis is provided in Section \ref{['s:exp_realgaze']}.
  • Figure 2: Qualitative comparison of samples from different datasets with near-identical annotations; visually divergent gaze directions, particularly in the pitch dimension, illustrate the inherent unreliability of cross-dataset vertical ground truths.
  • Figure 3: Framework of the proposed multi-task learning architecture. Red blocks indicate the streamlined architecture during inference.
  • Figure 4: Overview of the automated data augmentation pipeline. During training, we stochastically combine these methods for each sample to expand the training manifold.
  • Figure 5: Pipeline for pose-consistent eyeglasses template generation. (a) Original face images, with 30 discrete head poses. (b) GlassesGAN outputs, featuring diverse frame styles. (c) Extracted glasses templates and examples of augmented training samples.
  • ...and 18 more figures