Table of Contents
Fetching ...

LEyes: A Lightweight Framework for Deep Learning-Based Eye Tracking using Synthetic Eye Images

Sean Anthony Byrne, Virmarie Maquiling, Marcus Nyström, Enkelejda Kasneci, Diederick C. Niehorster

TL;DR

“Light Eyes” (LEyes), a novel framework that diverges from traditional photorealistic methods by utilizing simple synthetic image generators to train neural networks for detecting key image features like pupils and corneal reflections, is introduced.

Abstract

Deep learning has bolstered gaze estimation techniques, but real-world deployment has been impeded by inadequate training datasets. This problem is exacerbated by both hardware-induced variations in eye images and inherent biological differences across the recorded participants, leading to both feature and pixel-level variance that hinders the generalizability of models trained on specific datasets. While synthetic datasets can be a solution, their creation is both time and resource-intensive. To address this problem, we present a framework called Light Eyes or "LEyes" which, unlike conventional photorealistic methods, only models key image features required for video-based eye tracking using simple light distributions. LEyes facilitates easy configuration for training neural networks across diverse gaze-estimation tasks. We demonstrate that models trained using LEyes are consistently on-par or outperform other state-of-the-art algorithms in terms of pupil and CR localization across well-known datasets. In addition, a LEyes trained model outperforms the industry standard eye tracker using significantly more cost-effective hardware. Going forward, we are confident that LEyes will revolutionize synthetic data generation for gaze estimation models, and lead to significant improvements of the next generation video-based eye trackers.

LEyes: A Lightweight Framework for Deep Learning-Based Eye Tracking using Synthetic Eye Images

TL;DR

“Light Eyes” (LEyes), a novel framework that diverges from traditional photorealistic methods by utilizing simple synthetic image generators to train neural networks for detecting key image features like pupils and corneal reflections, is introduced.

Abstract

Deep learning has bolstered gaze estimation techniques, but real-world deployment has been impeded by inadequate training datasets. This problem is exacerbated by both hardware-induced variations in eye images and inherent biological differences across the recorded participants, leading to both feature and pixel-level variance that hinders the generalizability of models trained on specific datasets. While synthetic datasets can be a solution, their creation is both time and resource-intensive. To address this problem, we present a framework called Light Eyes or "LEyes" which, unlike conventional photorealistic methods, only models key image features required for video-based eye tracking using simple light distributions. LEyes facilitates easy configuration for training neural networks across diverse gaze-estimation tasks. We demonstrate that models trained using LEyes are consistently on-par or outperform other state-of-the-art algorithms in terms of pupil and CR localization across well-known datasets. In addition, a LEyes trained model outperforms the industry standard eye tracker using significantly more cost-effective hardware. Going forward, we are confident that LEyes will revolutionize synthetic data generation for gaze estimation models, and lead to significant improvements of the next generation video-based eye trackers.
Paper Structure (28 sections, 5 equations, 8 figures)

This paper contains 28 sections, 5 equations, 8 figures.

Figures (8)

  • Figure 1: A. Images from the four datasets we used to test the LEyes framework. B. The LEyes synthetic training sets corresponding to the real eye datasets in A. These images are based on the light distributions of the real eye datasets. C. This shows the predictions of the LEyes trained model on the real eye images. D. An overview of our approach: First, we establish a set of parameters based on the distributions of the collected data. These distributions pertain to pixel-level details like the iris and pupil intensity. Next, we employ a generator to efficiently produce new synthetic images from these parameters. The generated images are used to train a neural network which is then tested on real eye images recorded from the same device.
  • Figure 2: A. We compare the cumulative detection rate on the OpenEDS 2019 dataset of a U-Net model trained using the LEyes method at different pixel errors against PuRe santini2018pure, Pistol fuhl2023pistol, DeepVOG yiu2019deepvog, ELG park2018learning. B. We make special comparisons with several models trained using the EllSeg Framework kothari2020gazeEllseg_gen. C & D: The corresponding violin plots for panels A and B respectively, showing the detection rate at 2 pixel error for each participant in the testing set achieved by LEyes compared with the aforementioned models.
  • Figure 3: Flowchart of the simultaneous P-CR pipeline: Using an adaptive cropping strategy the center of the crop is determined using PuRe's pupil center prediction ($[X_{PuRe}, Y_{PuRe}]$) if the confidence metric for PuRe's prediction ($C$) is above a given confidence threshold ($C_{th}$), otherwise, the crop is determined by the pupil prediction of the LEyes-trained model given a naive center crop ($[X_{img\_center}, Y_{img\_center}]$). The pupil-centered crop is passed through the model, which outputs logits representing likely feature locations for each prediction, illustrated here as heat maps ($M$) for both the pupil ($M_{Pupil}$) and for each CR ($M_{CR1...5}$ in this example). For each CR map, the highest value is located. These peaks are compared between maps and the two highest values across all the maps determine which CRs are selected. The asterisks signify which maps contain the two highest values in this example. However, if the exclusion criteria are met, the image is deemed invalid (see text).
  • Figure 4: Heat maps for both the Chugh et al. 2021 dataset and the Openeds 2020 dataset. The maximum of the corresponding logit value is shown under each heat map. In the Chugh et al. 2021 dataset, the labeling of the CRs starts at the top-most IR reflection and then proceeds clockwise (top right). In the OpenEDS 2020 dataset, the labels used when training the model start at the lower right CR and proceed clockwise. Our algorithm selects the two highest logit values from the CR maps along with the pupil value for a complete robust P-CR pipeline. The last column shows the prediction locations of the centers of the pupil and selected CRs on the corresponding eye image.
  • Figure 5: Experimental setup: In a co-recorded setup we acquire eye images from the FLEX setup and gaze signals from the Eyelink 1000 Plus. We analyzed the eye images which we recorded from expert participants using a dual CNN approach. The pupil CNN localized the pupil center, while the CR CNN localized the center of the CR located in the eye image. Both CNNs achieved sub-pixel pixel error. Image of co-recording setup adapted from nystrom2023amplitude.
  • ...and 3 more figures