Table of Contents
Fetching ...

Using Deep Learning to Increase Eye-Tracking Robustness, Accuracy, and Precision in Virtual Reality

Kevin Barkevich, Reynold Bailey, Gabriel J. Diaz

TL;DR

This paper evaluates how contemporary ML-based eye feature segmentation networks influence gaze estimation quality in VR, comparing RITnet, EllSegGen, and ESFnet as preprocessing steps and as direct detectors against a native Pupil Labs detector, using both feature-based and 3D model-based gaze mappings. By deploying an open-source evaluation pipeline on VR eye-tracking data, it quantifies dropout rate, accuracy, and precision across two image resolutions and multiple gaze-estimation strategies. The findings show that well-performing segmentation models can reduce data dropouts and improve precision without sacrificing accuracy, with EllSegGen and ESFnet often delivering the strongest benefits, especially at 400×400 pixel resolution. The work provides practical guidelines for selecting pupil-detection networks in mobile VR and establishes an open framework for future, potentially real-time, ML-based eye-tracking improvements.

Abstract

Algorithms for the estimation of gaze direction from mobile and video-based eye trackers typically involve tracking a feature of the eye that moves through the eye camera image in a way that covaries with the shifting gaze direction, such as the center or boundaries of the pupil. Tracking these features using traditional computer vision techniques can be difficult due to partial occlusion and environmental reflections. Although recent efforts to use machine learning (ML) for pupil tracking have demonstrated superior results when evaluated using standard measures of segmentation performance, little is known of how these networks may affect the quality of the final gaze estimate. This work provides an objective assessment of the impact of several contemporary ML-based methods for eye feature tracking when the subsequent gaze estimate is produced using either feature-based or model-based methods. Metrics include the accuracy and precision of the gaze estimate, as well as drop-out rate.

Using Deep Learning to Increase Eye-Tracking Robustness, Accuracy, and Precision in Virtual Reality

TL;DR

This paper evaluates how contemporary ML-based eye feature segmentation networks influence gaze estimation quality in VR, comparing RITnet, EllSegGen, and ESFnet as preprocessing steps and as direct detectors against a native Pupil Labs detector, using both feature-based and 3D model-based gaze mappings. By deploying an open-source evaluation pipeline on VR eye-tracking data, it quantifies dropout rate, accuracy, and precision across two image resolutions and multiple gaze-estimation strategies. The findings show that well-performing segmentation models can reduce data dropouts and improve precision without sacrificing accuracy, with EllSegGen and ESFnet often delivering the strongest benefits, especially at 400×400 pixel resolution. The work provides practical guidelines for selecting pupil-detection networks in mobile VR and establishes an open framework for future, potentially real-time, ML-based eye-tracking improvements.

Abstract

Algorithms for the estimation of gaze direction from mobile and video-based eye trackers typically involve tracking a feature of the eye that moves through the eye camera image in a way that covaries with the shifting gaze direction, such as the center or boundaries of the pupil. Tracking these features using traditional computer vision techniques can be difficult due to partial occlusion and environmental reflections. Although recent efforts to use machine learning (ML) for pupil tracking have demonstrated superior results when evaluated using standard measures of segmentation performance, little is known of how these networks may affect the quality of the final gaze estimate. This work provides an objective assessment of the impact of several contemporary ML-based methods for eye feature tracking when the subsequent gaze estimate is produced using either feature-based or model-based methods. Metrics include the accuracy and precision of the gaze estimate, as well as drop-out rate.
Paper Structure (18 sections, 2 equations, 7 figures, 3 tables)

This paper contains 18 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The pipeline through which our experiment data is processed, starting from the input eye images/video frames and feeding into our analysis of the gaze estimations. By default the eye images are not preprocessed by a neural network. We aim to explore the impact of using various neural network-based feature detection techniques in the preprocessing phase. We also aim to explore the impact of several of these neural networks when used to directly output pupil locations, bypassing the Pupil Labs pupil detector.
  • Figure 2: Comparison of different semantic segmentation techniques applied to 192x192px eye images captured during our VR data collection sessions. Top: original images with Pupil Labs ' default pupil prediction (orange). Rows 2, 3, 4: semantic segmentation results for neural network-based techniques RITnet, EllSegGen, and ESFnet respectively. EllSegGen and ESFnet are also capable of directly predicting the ellipse parameters that encapsulate the pupil (orange) and iris (light green).
  • Figure 3: Pupil Labs HTC Vive Add-On, consisting of two infrared eye cameras and LEDs that fit inside the eye cavity of the HTC Vive Pro VR headset.
  • Figure 4: A comparison of dropout thresholds across all data used in the analysis portion of this experiment. Shown is the percentage of data that is retained for each dropout threshold. Due to the slope leveling out at around 10 degrees, we set the dropout threshold at 10 degrees.
  • Figure 5: Dropout rate (left), accuracy error (center), and precision error (right) across fixation point eccentricities for the 192x192px eye data collected from the feature -based (top) and 3D model-based (bottom) gaze estimation algorithms. Samples above the dropout threshold of 10$^{\circ}$ were omitted from calculations of accuracy and precision. Shading represents 95% confidence intervals for the mean. The range of the Y axis were chosen to provide insight into the performance of the best-performing algorithms, with the consequence that RITnet's and EllSegGen (Direct Iris)'s error falls beyond its range in some graphs. This data is presented in Tables \ref{['tab:robustness']}, \ref{['tab:accuracy']}, \ref{['tab:precision']}.
  • ...and 2 more figures