Table of Contents
Fetching ...

FACET: Fast and Accurate Event-Based Eye Tracking Using Ellipse Modeling for Extended Reality

Junyuan Ding, Ziteng Wang, Chang Gao, Min Liu, Qinyu Chen

TL;DR

FACET addresses the challenge of high-precision, low-latency eye tracking in XR by leveraging event camera data to directly predict pupil ellipse parameters with an end-to-end network. It introduces fast causal event volume, fixed-count event binning, and a trigonometric loss to robustly learn ellipse orientation, integrated into a MobileNetV3 backbone with an DSC-based FPN and four heads. The authors augment the EV-Eye dataset with ellipse-based annotations via semi-supervised labeling, achieving 0.2030-pixel pupil center error and 0.530 ms inference time, outperforming prior methods while using far fewer parameters and operations. This work demonstrates that pure event-based, ellipse-focused tracking can meet XR requirements, enabling efficient, high-frequency gaze estimation for next-generation head-mounted displays.

Abstract

Eye tracking is a key technology for gaze-based interactions in Extended Reality (XR), but traditional frame-based systems struggle to meet XR's demands for high accuracy, low latency, and power efficiency. Event cameras offer a promising alternative due to their high temporal resolution and low power consumption. In this paper, we present FACET (Fast and Accurate Event-based Eye Tracking), an end-to-end neural network that directly outputs pupil ellipse parameters from event data, optimized for real-time XR applications. The ellipse output can be directly used in subsequent ellipse-based pupil trackers. We enhance the EV-Eye dataset by expanding annotated data and converting original mask labels to ellipse-based annotations to train the model. Besides, a novel trigonometric loss is adopted to address angle discontinuities and a fast causal event volume event representation method is put forward. On the enhanced EV-Eye test set, FACET achieves an average pupil center error of 0.20 pixels and an inference time of 0.53 ms, reducing pixel error and inference time by 1.6$\times$ and 1.8$\times$ compared to the prior art, EV-Eye, with 4.4$\times$ and 11.7$\times$ less parameters and arithmetic operations. The code is available at https://github.com/DeanJY/FACET.

FACET: Fast and Accurate Event-Based Eye Tracking Using Ellipse Modeling for Extended Reality

TL;DR

FACET addresses the challenge of high-precision, low-latency eye tracking in XR by leveraging event camera data to directly predict pupil ellipse parameters with an end-to-end network. It introduces fast causal event volume, fixed-count event binning, and a trigonometric loss to robustly learn ellipse orientation, integrated into a MobileNetV3 backbone with an DSC-based FPN and four heads. The authors augment the EV-Eye dataset with ellipse-based annotations via semi-supervised labeling, achieving 0.2030-pixel pupil center error and 0.530 ms inference time, outperforming prior methods while using far fewer parameters and operations. This work demonstrates that pure event-based, ellipse-focused tracking can meet XR requirements, enabling efficient, high-frequency gaze estimation for next-generation head-mounted displays.

Abstract

Eye tracking is a key technology for gaze-based interactions in Extended Reality (XR), but traditional frame-based systems struggle to meet XR's demands for high accuracy, low latency, and power efficiency. Event cameras offer a promising alternative due to their high temporal resolution and low power consumption. In this paper, we present FACET (Fast and Accurate Event-based Eye Tracking), an end-to-end neural network that directly outputs pupil ellipse parameters from event data, optimized for real-time XR applications. The ellipse output can be directly used in subsequent ellipse-based pupil trackers. We enhance the EV-Eye dataset by expanding annotated data and converting original mask labels to ellipse-based annotations to train the model. Besides, a novel trigonometric loss is adopted to address angle discontinuities and a fast causal event volume event representation method is put forward. On the enhanced EV-Eye test set, FACET achieves an average pupil center error of 0.20 pixels and an inference time of 0.53 ms, reducing pixel error and inference time by 1.6 and 1.8 compared to the prior art, EV-Eye, with 4.4 and 11.7 less parameters and arithmetic operations. The code is available at https://github.com/DeanJY/FACET.
Paper Structure (23 sections, 5 equations, 5 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 5 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Flowchart for expanding the EV-Eye dataset and annotating it with ellipse labels. We first trained a U-Net segmentation network using over 9,000 frames with mask labels, enabling it to generate masks for other unlabeled frames. Then, we fitted these masks into ellipses to obtain five parameters $(x, y, a, b, \theta)$. Finally, we annotated the events corresponding to these frames with the ellipse labels generated by the U-Net to produce more annotated event data.
  • Figure 2: Flowchart of FACET. Event Processing: Input events are converted to a frame-like format using fixed count binning, fast causal event volume, and augmentation for training. Network: A MobileNetV3 backbone with FPN and DSC extracts and fuses features, which are then passed to four heads. Loss: Our total loss function includes several components, among which the customized trigonometric loss $L_T$ plays a crucial role. The term $L_T$ specifically addresses discontinuities in angle prediction, effectively measuring the difference between the predicted ellipse and the ground truth when combined with other losses. Detected Pupil: FACET directly generates ellipses end-to-end, unlike segmentation networks that first obtain a mask and then fit an ellipse. Subsequent Tracking: This direct ellipse generation lays the foundation for high-frequency event-based eye-tracking methods li2023trackzhaoEVEyeRethinkingHighfrequency2023.
  • Figure 3: Examples of different event accumulation methods: (a) Event Volume, (b) Causal Event Volume, (c) Fast Causal Event Volume. We consider accumulating the events at 3.0 ms timestamp, three events $e_1, e_2, e_3$ with positive polarity occur at 2.2 ms, 2.7 ms and 2.9 ms respectively. Since event volume (a) does not have temporal causality, these events will also affect the result at 2.0 ms, meaning that future events will influence past time. In the causal event volume example (b), temporal causality is preserved, and all events within the time window are processed. Our proposed fast causal event volume example (c) introduces a limit $l=0.5$ to optimize the accumulation. This reduces the contribution of earlier events (like $e_1$), speeding up the process for real-time inference, where only $e_1, e_2$ contribute based on the defined limit.
  • Figure 4: Examples of ellipses at different angles. (a) $\theta_a=179^\circ$, (b) $\theta_b=1^\circ$, and (c) $\theta_c=90^\circ$. Although $179^\circ$ and $1^\circ$ differ numerically, they produce ellipses more similar to each other than to $90^\circ$, implying their corresponding loss should reflect this pattern.
  • Figure 5: Visual comparison of E-Track, TennSt, and our FACET in four typical scenarios.