Appearance-based Gaze Estimation With Deep Learning: A Review and Benchmark
Yihua Cheng, Haofei Wang, Yiwei Bao, Feng Lu
TL;DR
This review addresses the problem of appearance-based gaze estimation by consolidating methods that map eye/face appearance to gaze using deep learning. It articulates a four-fold framework—deep feature extraction, network architectures, personal calibration, and device/platform considerations—and provides a comprehensive benchmark across public datasets with documented preprocessing and post-processing steps. Key contributions include a clear taxonomy of input types (eye, face, video), a survey of supervised and unsupervised learning strategies, and domain-adaptation approaches, complemented by a dataset-driven benchmark that enables fair cross-method comparisons. The work holds practical significance by guiding researchers toward robust, cross-subject gaze estimation and providing resources through public datasets and source code, thereby advancing real-world deployment of gaze-based interfaces. It also highlights challenges such as subject variability, head motion, and illumination, and recommends future directions in robust feature extraction, rapid calibration, and interpretable gaze representations.
Abstract
Human gaze provides valuable information on human focus and intentions, making it a crucial area of research. Recently, deep learning has revolutionized appearance-based gaze estimation. However, due to the unique features of gaze estimation research, such as the unfair comparison between 2D gaze positions and 3D gaze vectors and the different pre-processing and post-processing methods, there is a lack of a definitive guideline for developing deep learning-based gaze estimation algorithms. In this paper, we present a systematic review of the appearance-based gaze estimation methods using deep learning. Firstly, we survey the existing gaze estimation algorithms along the typical gaze estimation pipeline: deep feature extraction, deep learning model design, personal calibration and platforms. Secondly, to fairly compare the performance of different approaches, we summarize the data pre-processing and post-processing methods, including face/eye detection, data rectification, 2D/3D gaze conversion and gaze origin conversion. Finally, we set up a comprehensive benchmark for deep learning-based gaze estimation. We characterize all the public datasets and provide the source code of typical gaze estimation algorithms. This paper serves not only as a reference to develop deep learning-based gaze estimation methods, but also a guideline for future gaze estimation research. The project web page can be found at https://phi-ai.buaa.edu.cn/Gazehub.
