SLYKLatent: A Learning Framework for Gaze Estimation Using Deep Facial Feature Learning
Samuel Adebayo, Joost C. Dessing, Seán McLoone
TL;DR
SLYKLatent tackles appearance uncertainty and domain generalization in gaze estimation by marrying self-supervised learning with a patch-focused downstream refinement. The framework uses a modified BYOL backbone (mBYOL) to learn global face and local eye representations, augmented with a Patch Module Network to fuse eye-patch features during fine-tuning, and an inverse explained variance loss to weight difficult predictions. Ablation studies corroborate the critical roles of the local eye branch, PMN, and inv-EV, with strong gains on MPIIFaceGaze, Gaze360, and ETH-XGaze, as well as robust performance under appearance variations and rotations. The approach also demonstrates transferable benefits to facial expression recognition, highlighting the method’s versatility for vision tasks beyond gaze estimation, though it relies on reliable eye-patch detection and shows room for further probabilistic uncertainty modeling.
Abstract
In this research, we present SLYKLatent, a novel approach for enhancing gaze estimation by addressing appearance instability challenges in datasets due to aleatoric uncertainties, covariant shifts, and test domain generalization. SLYKLatent utilizes Self-Supervised Learning for initial training with facial expression datasets, followed by refinement with a patch-based tri-branch network and an inverse explained variance-weighted training loss function. Our evaluation on benchmark datasets achieves a 10.9% improvement on Gaze360, supersedes top MPIIFaceGaze results with 3.8%, and leads on a subset of ETH-XGaze by 11.6%, surpassing existing methods by significant margins. Adaptability tests on RAF-DB and Affectnet show 86.4% and 60.9% accuracies, respectively. Ablation studies confirm the effectiveness of SLYKLatent's novel components.
