SLYKLatent: A Learning Framework for Gaze Estimation Using Deep Facial Feature Learning

Samuel Adebayo; Joost C. Dessing; Seán McLoone

SLYKLatent: A Learning Framework for Gaze Estimation Using Deep Facial Feature Learning

Samuel Adebayo, Joost C. Dessing, Seán McLoone

TL;DR

SLYKLatent tackles appearance uncertainty and domain generalization in gaze estimation by marrying self-supervised learning with a patch-focused downstream refinement. The framework uses a modified BYOL backbone (mBYOL) to learn global face and local eye representations, augmented with a Patch Module Network to fuse eye-patch features during fine-tuning, and an inverse explained variance loss to weight difficult predictions. Ablation studies corroborate the critical roles of the local eye branch, PMN, and inv-EV, with strong gains on MPIIFaceGaze, Gaze360, and ETH-XGaze, as well as robust performance under appearance variations and rotations. The approach also demonstrates transferable benefits to facial expression recognition, highlighting the method’s versatility for vision tasks beyond gaze estimation, though it relies on reliable eye-patch detection and shows room for further probabilistic uncertainty modeling.

Abstract

In this research, we present SLYKLatent, a novel approach for enhancing gaze estimation by addressing appearance instability challenges in datasets due to aleatoric uncertainties, covariant shifts, and test domain generalization. SLYKLatent utilizes Self-Supervised Learning for initial training with facial expression datasets, followed by refinement with a patch-based tri-branch network and an inverse explained variance-weighted training loss function. Our evaluation on benchmark datasets achieves a 10.9% improvement on Gaze360, supersedes top MPIIFaceGaze results with 3.8%, and leads on a subset of ETH-XGaze by 11.6%, surpassing existing methods by significant margins. Adaptability tests on RAF-DB and Affectnet show 86.4% and 60.9% accuracies, respectively. Ablation studies confirm the effectiveness of SLYKLatent's novel components.

SLYKLatent: A Learning Framework for Gaze Estimation Using Deep Facial Feature Learning

TL;DR

Abstract

Paper Structure (26 sections, 19 equations, 6 figures, 4 tables)

This paper contains 26 sections, 19 equations, 6 figures, 4 tables.

Introduction
Background and Motivation
Purpose and Contribution
Background
Gaze Estimation
Learning-based Appearance Gaze Estimation
Gaze Estimation using Facial Patches
Self-supervised Learning for Improving Gaze Estimation
The SLYKLatent Framework
mBYOL
Augmentation View
Modification of the Representation layer
Architecture details of mBYOL
Downstream Transfer Learning Fine-tuning
SLYKLatent Downstream Gaze Estimation
...and 11 more sections

Figures (6)

Figure 1: High-Level view of SLYKLatent. The framework consists of two network modules; a self-supervised pretraining module and a patch module network.
Figure 2: The Modified Bootstrap Your Own Latent (mBYOL) Architecture. The Architecture is made up of two parallel asymmetrical networks, the target network $\tau'$ and the online network $\tau$. Each network is made up of 3 essential stages: augmented view $v$; representation stage $y$ (this is where image embeddings are computed), and; projection layer $z$. In addition to this, the online network includes a prediction stage. The mBYOL loss computes the Negative Cosine similarities between the online and target network.
Figure 3: The Augmentation view applied on mBYOL.
Figure 4: Schematic diagram of the proposed framework, SLYKLatent. SLYKLatent comprises mBYOL, a modification of BYOL frameworkc17; and a downstream finetunning which is made up of an eye-patch module network. The resultant features are concatenated to regress against the ground-truth gaze vector.
Figure 5: Comparison of gaze estimation under appearance uncertainties with L2CSNet. The top row shows predictions from the L2CSNet model, while the bottom row displays predictions from our SLYKLatent model. Various appearance uncertainties, such as low illumination and image blurriness, are demonstrated. Red arrows indicate ground truth gaze directions, Yellow arrows represent predictions from the L2CSNet model, and Blue arrows represent predictions from the SLYKLatent model.
...and 1 more figures

SLYKLatent: A Learning Framework for Gaze Estimation Using Deep Facial Feature Learning

TL;DR

Abstract

SLYKLatent: A Learning Framework for Gaze Estimation Using Deep Facial Feature Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)