Table of Contents
Fetching ...

GazeSCRNN: Event-based Near-eye Gaze Tracking using a Spiking Neural Network

Stijn Groenen, Marzieh Hassanshahi Varposhti, Mahyar Shahsavari

TL;DR

GazeSCRNN tackles the problem of high-temporal-resolution, energy-efficient near-eye gaze tracking by processing event streams from Dynamic Vision Sensors with a spiking convolutional recurrent network. The method combines Adaptive Leaky-Integrate-and-Fire neurons, convolutional feature extraction, and recurrent temporal integration, trained with surrogate gradients and Forward-Propagation-Through-Time to balance accuracy and memory. Key contributions include the GazeSCRNN architecture, extensive ablations on framing, neuron types, and training strategies, and an empirical MAE of 6.034° and MPE of 2.094 mm on challenging settings, using the EV-Eye dataset. The work highlights both the promise of SNN-based event-driven gaze tracking and the practical challenges of ground-truth quality, framing choices, and deployment on neuromorphic hardware for real-time performance.

Abstract

This work introduces GazeSCRNN, a novel spiking convolutional recurrent neural network designed for event-based near-eye gaze tracking. Leveraging the high temporal resolution, energy efficiency, and compatibility of Dynamic Vision Sensor (DVS) cameras with event-based systems, GazeSCRNN uses a spiking neural network (SNN) to address the limitations of traditional gaze-tracking systems in capturing dynamic movements. The proposed model processes event streams from DVS cameras using Adaptive Leaky-Integrate-and-Fire (ALIF) neurons and a hybrid architecture optimized for spatio-temporal data. Extensive evaluations on the EV-Eye dataset demonstrate the model's accuracy in predicting gaze vectors. In addition, we conducted ablation studies to reveal the importance of the ALIF neurons, dynamic event framing, and training techniques, such as Forward-Propagation-Through-Time, in enhancing overall system performance. The most accurate model achieved a Mean Angle Error (MAE) of 6.034° and a Mean Pupil Error (MPE) of 2.094 mm. Consequently, this work is pioneering in demonstrating the feasibility of using SNNs for event-based gaze tracking, while shedding light on critical challenges and opportunities for further improvement.

GazeSCRNN: Event-based Near-eye Gaze Tracking using a Spiking Neural Network

TL;DR

GazeSCRNN tackles the problem of high-temporal-resolution, energy-efficient near-eye gaze tracking by processing event streams from Dynamic Vision Sensors with a spiking convolutional recurrent network. The method combines Adaptive Leaky-Integrate-and-Fire neurons, convolutional feature extraction, and recurrent temporal integration, trained with surrogate gradients and Forward-Propagation-Through-Time to balance accuracy and memory. Key contributions include the GazeSCRNN architecture, extensive ablations on framing, neuron types, and training strategies, and an empirical MAE of 6.034° and MPE of 2.094 mm on challenging settings, using the EV-Eye dataset. The work highlights both the promise of SNN-based event-driven gaze tracking and the practical challenges of ground-truth quality, framing choices, and deployment on neuromorphic hardware for real-time performance.

Abstract

This work introduces GazeSCRNN, a novel spiking convolutional recurrent neural network designed for event-based near-eye gaze tracking. Leveraging the high temporal resolution, energy efficiency, and compatibility of Dynamic Vision Sensor (DVS) cameras with event-based systems, GazeSCRNN uses a spiking neural network (SNN) to address the limitations of traditional gaze-tracking systems in capturing dynamic movements. The proposed model processes event streams from DVS cameras using Adaptive Leaky-Integrate-and-Fire (ALIF) neurons and a hybrid architecture optimized for spatio-temporal data. Extensive evaluations on the EV-Eye dataset demonstrate the model's accuracy in predicting gaze vectors. In addition, we conducted ablation studies to reveal the importance of the ALIF neurons, dynamic event framing, and training techniques, such as Forward-Propagation-Through-Time, in enhancing overall system performance. The most accurate model achieved a Mean Angle Error (MAE) of 6.034° and a Mean Pupil Error (MPE) of 2.094 mm. Consequently, this work is pioneering in demonstrating the feasibility of using SNNs for event-based gaze tracking, while shedding light on critical challenges and opportunities for further improvement.

Paper Structure

This paper contains 19 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of the event-based gaze tracking pipeline. \ref{['fig:events_plot']} shows input events captured by a Dynamic Vision Sensor (DVS) camera. \ref{['fig:frames_plot']} shows the events aggregated into frames during data preprocessing. \ref{['fig:spike_plot']} shows frames processed by the spiking neural network, producing sequences of complex spike patterns. \ref{['fig:gaze_vectors_plot']} shows output gaze vectors, indicating the user's point of focus.
  • Figure 2: Adaptive Leaky-Integrate-and-Fire Neuron with Liquid Time Constants
  • Figure 3: GazeSCRNN, a Spiking Convolutional Recurrent Neural Network architecture for event-based gaze tracking
  • Figure 4: Mean Angle Error (MAE) and Mean Pupil Error (MPE) vs. the number of Truncated Backpropagation-Through-Time Time Steps ($T$) for models trained with and without Forward-Propagation-Through-Time (FPTT)
  • Figure 5: Mean Angle Error (MAE) and Mean Pupil Error (MPE) vs. the number of events per aggregated frame in the input sequence