Table of Contents
Fetching ...

EETnet: a CNN for Gaze Detection and Tracking for Smart-Eyewear

Andrea Aspesi, Andrea Simpsi, Aaron Tognoli, Simone Mentasti, Luca Merigo, Matteo Matteucci

TL;DR

EETnet addresses the challenge of real-time eye tracking on embedded devices by using a compact CNN that processes purely event-based eye data. It provides two output modes—regression for pupil coordinates and grid-based classification for position within a frame—trained on 200 Hz event frames with careful ROI alignment and semi-automatic ground-truth annotation. Through quantization-aware training, the model is deployed on diverse microcontrollers, with MAX78000 delivering sub-3 ms inferences at under 1 mJ per inference, demonstrating practical viability for battery-powered smart eyewear. The approach combines dataset preprocessing, architecture optimization, and hardware-aware deployment to enable low-latency, energy-efficient gaze tracking in wearables.

Abstract

Event-based cameras are becoming a popular solution for efficient, low-power eye tracking. Due to the sparse and asynchronous nature of event data, they require less processing power and offer latencies in the microsecond range. However, many existing solutions are limited to validation on powerful GPUs, with no deployment on real embedded devices. In this paper, we present EETnet, a convolutional neural network designed for eye tracking using purely event-based data, capable of running on microcontrollers with limited resources. Additionally, we outline a methodology to train, evaluate, and quantize the network using a public dataset. Finally, we propose two versions of the architecture: a classification model that detects the pupil on a grid superimposed on the original image, and a regression model that operates at the pixel level.

EETnet: a CNN for Gaze Detection and Tracking for Smart-Eyewear

TL;DR

EETnet addresses the challenge of real-time eye tracking on embedded devices by using a compact CNN that processes purely event-based eye data. It provides two output modes—regression for pupil coordinates and grid-based classification for position within a frame—trained on 200 Hz event frames with careful ROI alignment and semi-automatic ground-truth annotation. Through quantization-aware training, the model is deployed on diverse microcontrollers, with MAX78000 delivering sub-3 ms inferences at under 1 mJ per inference, demonstrating practical viability for battery-powered smart eyewear. The approach combines dataset preprocessing, architecture optimization, and hardware-aware deployment to enable low-latency, energy-efficient gaze tracking in wearables.

Abstract

Event-based cameras are becoming a popular solution for efficient, low-power eye tracking. Due to the sparse and asynchronous nature of event data, they require less processing power and offer latencies in the microsecond range. However, many existing solutions are limited to validation on powerful GPUs, with no deployment on real embedded devices. In this paper, we present EETnet, a convolutional neural network designed for eye tracking using purely event-based data, capable of running on microcontrollers with limited resources. Additionally, we outline a methodology to train, evaluate, and quantize the network using a public dataset. Finally, we propose two versions of the architecture: a classification model that detects the pupil on a grid superimposed on the original image, and a regression model that operates at the pixel level.

Paper Structure

This paper contains 11 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of the EETnet pipeline. The network, designed to run on a microcontroller, processes accumulated event frames of an eye to detect pupil location using either a regression or a classification approach.
  • Figure 2: Heatmap of pupil center locations. The top image illustrates the original distribution, while the bottom image shows the distribution after augmentation.
  • Figure 3: Example of Augmented frames. White pixels are positive events, blue are negative events, yellow cross is the labeled center.
  • Figure 4: Schema of EETnet regression model (upper scheme) and classification model (bottom scheme), the only difference is the last layer with 577 neurons instead of 2.
  • Figure 5: Comparisons of energy consumption per inference, time per inference, and Multiply and Accumulate operations over cycles on 3 different microcontrollers: STM32U5, STM32H7, and MAX78000.