Table of Contents
Fetching ...

Appearance Debiased Gaze Estimation via Stochastic Subject-Wise Adversarial Learning

Suneung Kim, Woo-Jeoung Nam, Seong-Whan Lee

TL;DR

This work tackles appearance bias in appearance-based gaze estimation by introducing SAZE, which combines a Face generalization Network (Fgen-Net) with an adversarial loss to induce appearance-invariant gaze features and a stochastic subject-wise meta-learning strategy to mitigate overfitting to limited subjects. The method achieves state-of-the-art mean angular errors on MPIIGaze ($3.89^{\circ}$) and EyeDiap ($4.42^{\circ}$) and demonstrates improved generalization both within and across datasets, including evaluations with GAN-generated style variations. Key contributions include the adversarial loss that trains the identity classifier to predict a uniform distribution over subjects, and the stochastic subject-wise optimization inspired by Reptile to diversify training subject appearances. The approach yields practical benefits by using only face images (no dual-eye inputs) and reducing computational complexity, while maintaining robust generalization across diverse environments and unseen domains.

Abstract

Recently, appearance-based gaze estimation has been attracting attention in computer vision, and remarkable improvements have been achieved using various deep learning techniques. Despite such progress, most methods aim to infer gaze vectors from images directly, which causes overfitting to person-specific appearance factors. In this paper, we address these challenges and propose a novel framework: Stochastic subject-wise Adversarial gaZE learning (SAZE), which trains a network to generalize the appearance of subjects. We design a Face generalization Network (Fgen-Net) using a face-to-gaze encoder and face identity classifier and a proposed adversarial loss. The proposed loss generalizes face appearance factors so that the identity classifier inferences a uniform probability distribution. In addition, the Fgen-Net is trained by a learning mechanism that optimizes the network by reselecting a subset of subjects at every training step to avoid overfitting. Our experimental results verify the robustness of the method in that it yields state-of-the-art performance, achieving 3.89 and 4.42 on the MPIIGaze and EyeDiap datasets, respectively. Furthermore, we demonstrate the positive generalization effect by conducting further experiments using face images involving different styles generated from the generative model.

Appearance Debiased Gaze Estimation via Stochastic Subject-Wise Adversarial Learning

TL;DR

This work tackles appearance bias in appearance-based gaze estimation by introducing SAZE, which combines a Face generalization Network (Fgen-Net) with an adversarial loss to induce appearance-invariant gaze features and a stochastic subject-wise meta-learning strategy to mitigate overfitting to limited subjects. The method achieves state-of-the-art mean angular errors on MPIIGaze () and EyeDiap () and demonstrates improved generalization both within and across datasets, including evaluations with GAN-generated style variations. Key contributions include the adversarial loss that trains the identity classifier to predict a uniform distribution over subjects, and the stochastic subject-wise optimization inspired by Reptile to diversify training subject appearances. The approach yields practical benefits by using only face images (no dual-eye inputs) and reducing computational complexity, while maintaining robust generalization across diverse environments and unseen domains.

Abstract

Recently, appearance-based gaze estimation has been attracting attention in computer vision, and remarkable improvements have been achieved using various deep learning techniques. Despite such progress, most methods aim to infer gaze vectors from images directly, which causes overfitting to person-specific appearance factors. In this paper, we address these challenges and propose a novel framework: Stochastic subject-wise Adversarial gaZE learning (SAZE), which trains a network to generalize the appearance of subjects. We design a Face generalization Network (Fgen-Net) using a face-to-gaze encoder and face identity classifier and a proposed adversarial loss. The proposed loss generalizes face appearance factors so that the identity classifier inferences a uniform probability distribution. In addition, the Fgen-Net is trained by a learning mechanism that optimizes the network by reselecting a subset of subjects at every training step to avoid overfitting. Our experimental results verify the robustness of the method in that it yields state-of-the-art performance, achieving 3.89 and 4.42 on the MPIIGaze and EyeDiap datasets, respectively. Furthermore, we demonstrate the positive generalization effect by conducting further experiments using face images involving different styles generated from the generative model.
Paper Structure (22 sections, 11 equations, 9 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 11 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of the SAZE framework. Given training images with ground-truth gaze directions, we first select subjects in the training data and then construct meta-training and meta-adapting sets. The Face generalization Network (Fgen-Net) is trained using the proposed adversarial loss and meta-learning using the two composed subsets.
  • Figure 2: Illustration of Face generalization Network (Fgen-Net) learning based on adversarial loss. (a) Fgen-Net is optimized through two forward paths. In the first path, the face identity classifier $c^{id}$ is updated through the identity loss $L_{idc}$ (Eq. (1)), which distinguishes the face appearance. Next, in the second path, the face-to-gaze encoder $E^{g}$ is updated through the adversarial loss $L_{adv}$ (Eq. (3)) and gaze loss $L_{g}$ (Eq. (4)) and the adversarial loss induces the face classifier to predict a uniform probability for face appearances, which achieves a better generalization performance. (b) Detailed architecture for face-to-gaze encoder $E^{g}$. * means element-wise multiplication. (c) Detailed architecture for face identity classifier $c^{id}$.
  • Figure 3: Image examples generated by a generative model shen2020interpreting. The first column illustrates the outcomes of gender-specific latent space adjustment, the second column pertains to age, and the third column pertains to pose.
  • Figure 4: Participant-wise gaze accuracy.
  • Figure 5: Comparison of T-SNE visualization using point cloud classification results for SWCNN and SAZE. The red points represent untrained subjects.
  • ...and 4 more figures