Retina : Low-Power Eye Tracking with Event Camera and Spiking Hardware
Pietro Bonazzi, Sizhen Bian, Giovanni Lippolis, Yawei Li, Sadique Sheik, Michele Magno
TL;DR
This paper presents Retina, a neuromorphic eye-tracking solution that processes pure event data from a Dynamic Vision Sensor using a directly trained Spiking Neural Network deployed on the Speck edge processor. It introduces Ini-30, an event-based pupil dataset captured with glass-mounted DVS cameras from 30 volunteers, and demonstrates end-to-end performance with low power (approximately $2.89$ to $4.8$ mW) and low latency (approximately $5.57$ to $8.01$ ms). Retina achieves a pupil-centroid error of about $3.24$ px on a $64\times64$ DVS input while requiring far fewer MACs ($3.03$M) and parameters ($63$k) than prior methods like 3ET. The combination of a lightweight SNN with a temporal weighted-sum regression and end-to-end neuromorphic deployment yields a competitive, energy-efficient, event-based eye-tracking pipeline suitable for wearable, edge devices and real-world use.
Abstract
This paper introduces a neuromorphic methodology for eye tracking, harnessing pure event data captured by a Dynamic Vision Sensor (DVS) camera. The framework integrates a directly trained Spiking Neuron Network (SNN) regression model and leverages a state-of-the-art low power edge neuromorphic processor - Speck, collectively aiming to advance the precision and efficiency of eye-tracking systems. First, we introduce a representative event-based eye-tracking dataset, "Ini-30", which was collected with two glass-mounted DVS cameras from thirty volunteers. Then,a SNN model, based on Integrate And Fire (IAF) neurons, named "Retina", is described , featuring only 64k parameters (6.63x fewer than the latest) and achieving pupil tracking error of only 3.24 pixels in a 64x64 DVS input. The continous regression output is obtained by means of convolution using a non-spiking temporal 1D filter slided across the output spiking layer. Finally, we evaluate Retina on the neuromorphic processor, showing an end-to-end power between 2.89-4.8 mW and a latency of 5.57-8.01 mS dependent on the time window. We also benchmark our model against the latest event-based eye-tracking method, "3ET", which was built upon event frames. Results show that Retina achieves superior precision with 1.24px less pupil centroid error and reduced computational complexity with 35 times fewer MAC operations. We hope this work will open avenues for further investigation of close-loop neuromorphic solutions and true event-based training pursuing edge performance.
