Table of Contents
Fetching ...

Appearance-based Gaze Estimation With Deep Learning: A Review and Benchmark

Yihua Cheng, Haofei Wang, Yiwei Bao, Feng Lu

TL;DR

This review addresses the problem of appearance-based gaze estimation by consolidating methods that map eye/face appearance to gaze using deep learning. It articulates a four-fold framework—deep feature extraction, network architectures, personal calibration, and device/platform considerations—and provides a comprehensive benchmark across public datasets with documented preprocessing and post-processing steps. Key contributions include a clear taxonomy of input types (eye, face, video), a survey of supervised and unsupervised learning strategies, and domain-adaptation approaches, complemented by a dataset-driven benchmark that enables fair cross-method comparisons. The work holds practical significance by guiding researchers toward robust, cross-subject gaze estimation and providing resources through public datasets and source code, thereby advancing real-world deployment of gaze-based interfaces. It also highlights challenges such as subject variability, head motion, and illumination, and recommends future directions in robust feature extraction, rapid calibration, and interpretable gaze representations.

Abstract

Human gaze provides valuable information on human focus and intentions, making it a crucial area of research. Recently, deep learning has revolutionized appearance-based gaze estimation. However, due to the unique features of gaze estimation research, such as the unfair comparison between 2D gaze positions and 3D gaze vectors and the different pre-processing and post-processing methods, there is a lack of a definitive guideline for developing deep learning-based gaze estimation algorithms. In this paper, we present a systematic review of the appearance-based gaze estimation methods using deep learning. Firstly, we survey the existing gaze estimation algorithms along the typical gaze estimation pipeline: deep feature extraction, deep learning model design, personal calibration and platforms. Secondly, to fairly compare the performance of different approaches, we summarize the data pre-processing and post-processing methods, including face/eye detection, data rectification, 2D/3D gaze conversion and gaze origin conversion. Finally, we set up a comprehensive benchmark for deep learning-based gaze estimation. We characterize all the public datasets and provide the source code of typical gaze estimation algorithms. This paper serves not only as a reference to develop deep learning-based gaze estimation methods, but also a guideline for future gaze estimation research. The project web page can be found at https://phi-ai.buaa.edu.cn/Gazehub.

Appearance-based Gaze Estimation With Deep Learning: A Review and Benchmark

TL;DR

This review addresses the problem of appearance-based gaze estimation by consolidating methods that map eye/face appearance to gaze using deep learning. It articulates a four-fold framework—deep feature extraction, network architectures, personal calibration, and device/platform considerations—and provides a comprehensive benchmark across public datasets with documented preprocessing and post-processing steps. Key contributions include a clear taxonomy of input types (eye, face, video), a survey of supervised and unsupervised learning strategies, and domain-adaptation approaches, complemented by a dataset-driven benchmark that enables fair cross-method comparisons. The work holds practical significance by guiding researchers toward robust, cross-subject gaze estimation and providing resources through public datasets and source code, thereby advancing real-world deployment of gaze-based interfaces. It also highlights challenges such as subject variability, head motion, and illumination, and recommends future directions in robust feature extraction, rapid calibration, and interpretable gaze representations.

Abstract

Human gaze provides valuable information on human focus and intentions, making it a crucial area of research. Recently, deep learning has revolutionized appearance-based gaze estimation. However, due to the unique features of gaze estimation research, such as the unfair comparison between 2D gaze positions and 3D gaze vectors and the different pre-processing and post-processing methods, there is a lack of a definitive guideline for developing deep learning-based gaze estimation algorithms. In this paper, we present a systematic review of the appearance-based gaze estimation methods using deep learning. Firstly, we survey the existing gaze estimation algorithms along the typical gaze estimation pipeline: deep feature extraction, deep learning model design, personal calibration and platforms. Secondly, to fairly compare the performance of different approaches, we summarize the data pre-processing and post-processing methods, including face/eye detection, data rectification, 2D/3D gaze conversion and gaze origin conversion. Finally, we set up a comprehensive benchmark for deep learning-based gaze estimation. We characterize all the public datasets and provide the source code of typical gaze estimation algorithms. This paper serves not only as a reference to develop deep learning-based gaze estimation methods, but also a guideline for future gaze estimation research. The project web page can be found at https://phi-ai.buaa.edu.cn/Gazehub.

Paper Structure

This paper contains 34 sections, 6 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Deep learning-based gaze estimation relies on simple devices but complex algorithms to estimate human gaze. It usually uses off-the-shelf cameras to capture facial appearance, and employs deep learning algorithms to regress gaze from the appearance. According to this pipeline, we survey current deep learning-based gaze estimation methods from four perspectives: deep feature extraction, deep learning model design, personal calibration, and platforms.
  • Figure 2: From intrusive skin electrodes Young_1975_survey to off-shelf web cameras Zhang_2015_CVPR, gaze estimation is more flexible. Gaze estimation methods are also updated with the change of devices. We illustrate five kinds of gaze estimation methods. (1). Attached sensor-based methods. The method samples the electrical signal of skin electrodes. The signal indicates the eye movement of subjects Eggert_2007_No. (2) 3D eye model recovery methods. The method usually builds a geometric eye model to calculate the visual axis, i.e., gaze directions. The eye model is fitted based on the light reflection. (3) 2D eye feature regression methods. The method relies on IR cameras to detect geometric eye features such as pupil center, glints, and directly regress the PoG from these features. (4) Conventional appearance-based methods. The method use entire images as feature and directly regress human gaze from features. Some feature reduction methods are also used for extracting low-dimensional feature. For example, Lu et al. divide eye images into 15 subregion and sum the pixel intensities in each subregion as feature Lu_2014_TPAMI. (5) Appearance-based gaze estimation with deep learning, which is the recent hotspots. Face or eye images are directly inputted into a designed neural network to learn latent feature representation, and human gaze is regressed from the feature representation.
  • Figure 3: The architecture of section 3. We introduce gaze estimation with deep learning from four perspectives.
  • Figure 4: Some typical CNN-based gaze estimation networks. (a). Gaze estimation with eye images Zhang_2017_tpami. (b) Gaze estimation with face images Zhang_2017_CVPRW. (c). Gaze estimation with face and eye images Krafka_2016_CVPR.
  • Figure 5: Gaze estimation with videos. It first extracts static features from each frame using a typical CNN, and feeds these static features into RNN for extracting temporal information.
  • ...and 10 more figures