Evaluating Image-Based Face and Eye Tracking with Event Cameras

Khadija Iddrisu; Waseem Shariff; Noel E. OConnor; Joseph Lemley; Suzanne Little

Evaluating Image-Based Face and Eye Tracking with Event Cameras

Khadija Iddrisu, Waseem Shariff, Noel E. OConnor, Joseph Lemley, Suzanne Little

TL;DR

This paper addresses the challenge of leveraging event cameras for face and eye tracking by converting asynchronous events $(x, y, t, p)$ into frame-based inputs using Temporal Binary Representation over a window $ abla t$, enabling the use of standard CNN detectors. It constructs a synthetic event dataset by simulating 6-DOF planar motion on the Helen dataset via PlanarMotionStream and evaluates two detectors, GR-YOLOv3 and YOLOv8, against voxel-based baselines and real-event datasets (FES, Ryan). Results show frame-based representations achieve competitive performance, with YOLOv8 delivering strong mAP improvements on both synthetic and real data (e.g., up to $mAP\approx 0.97$ on real data) and eye-detection robustness across lighting conditions. The study demonstrates the practicality of bridging event-camera advantages (low latency, high dynamic range) with CNN-based tracking, offering a computationally efficient pathway for real-time face and eye tracking in neuromorphic vision systems.

Abstract

Event Cameras, also known as Neuromorphic sensors, capture changes in local light intensity at the pixel level, producing asynchronously generated data termed ``events''. This distinct data format mitigates common issues observed in conventional cameras, like under-sampling when capturing fast-moving objects, thereby preserving critical information that might otherwise be lost. However, leveraging this data often necessitates the development of specialized, handcrafted event representations that can integrate seamlessly with conventional Convolutional Neural Networks (CNNs), considering the unique attributes of event data. In this study, We evaluate event-based Face and Eye tracking. The core objective of our study is to showcase the viability of integrating conventional algorithms with event-based data, transformed into a frame format while preserving the unique benefits of event cameras. To validate our approach, we constructed a frame-based event dataset by simulating events between RGB frames derived from the publicly accessible Helen Dataset. We assess its utility for face and eye detection tasks through the application of GR-YOLO -- a pioneering technique derived from YOLOv3. This evaluation includes a comparative analysis with results derived from training the dataset with YOLOv8. Subsequently, the trained models were tested on real event streams from various iterations of Prophesee's event cameras and further evaluated on the Faces in Event Stream (FES) benchmark dataset. The models trained on our dataset shows a good prediction performance across all the datasets obtained for validation with the best results of a mean Average precision score of 0.91. Additionally, The models trained demonstrated robust performance on real event camera data under varying light conditions.

Evaluating Image-Based Face and Eye Tracking with Event Cameras

TL;DR

This paper addresses the challenge of leveraging event cameras for face and eye tracking by converting asynchronous events

into frame-based inputs using Temporal Binary Representation over a window

, enabling the use of standard CNN detectors. It constructs a synthetic event dataset by simulating 6-DOF planar motion on the Helen dataset via PlanarMotionStream and evaluates two detectors, GR-YOLOv3 and YOLOv8, against voxel-based baselines and real-event datasets (FES, Ryan). Results show frame-based representations achieve competitive performance, with YOLOv8 delivering strong mAP improvements on both synthetic and real data (e.g., up to

on real data) and eye-detection robustness across lighting conditions. The study demonstrates the practicality of bridging event-camera advantages (low latency, high dynamic range) with CNN-based tracking, offering a computationally efficient pathway for real-time face and eye tracking in neuromorphic vision systems.

Abstract

Paper Structure (16 sections, 5 figures, 3 tables)

This paper contains 16 sections, 5 figures, 3 tables.

Introduction
Literature Review
Face Detection and Tracking
Eye tracking
Dataset
Event Representation
Network Architecture
GR-YOLOv3
YOLOv8
Training
Experiments and Evaluation
Quantitative Results: Synthetic Data Evaluation
Quantitative Results: Evaluation on Real Event Camera Data
Qualitative Results
Conclusions
...and 1 more sections

Figures (5)

Figure 1: Overview of our proposed methodology
Figure 2: A sample video showcasing motion derived from an RGB image, transformed into events and then rebuilt into an event frame. From the left: original RGB image, 3 frames showing generated motion and compiled event frame.
Figure 3: Event Representation Procedure: The value in position $(x, y)$ is obtained as $b^i _{x,y}= \mathbf{1}(x, y)$, where $\mathbf{1}(x, y)$ is an indicator function returning 1 if an event is present in position ($x$, $y$) and 0 otherwise (image from innocenti2021temporal)
Figure 4: Prediction Performance of GR-YOLOv3.
Figure 5: Prediction Performance of YOLOv8.

Evaluating Image-Based Face and Eye Tracking with Event Cameras

TL;DR

Abstract

Evaluating Image-Based Face and Eye Tracking with Event Cameras

Authors

TL;DR

Abstract

Table of Contents

Figures (5)