Table of Contents
Fetching ...

3ET: Efficient Event-based Eye Tracking using a Change-Based ConvLSTM Network

Qinyu Chen, Zuowen Wang, Shih-Chii Liu, Chang Gao

TL;DR

This work tackles real-time pupil tracking from sparse event streams using an efficient event-based approach for wearables. It introduces Change-Based ConvLSTM (CB-ConvLSTM), which injects temporal sparsity by using the thresholded hidden-change ΔH_{t-1} in gate computations, including a formal ΔH_{t-1} definition. On a synthetic DVS LPW pupil dataset, the method achieves roughly 85.3% temporal sparsity with a 4.7× reduction in arithmetic operations and maintains accuracy, outperforming CNN baselines by over 30%. The approach is well-suited for low-power AR/VR headsets and can benefit from hardware that exploits spatio-temporal sparsity; code and data are publicly available.

Abstract

This paper presents a sparse Change-Based Convolutional Long Short-Term Memory (CB-ConvLSTM) model for event-based eye tracking, key for next-generation wearable healthcare technology such as AR/VR headsets. We leverage the benefits of retina-inspired event cameras, namely their low-latency response and sparse output event stream, over traditional frame-based cameras. Our CB-ConvLSTM architecture efficiently extracts spatio-temporal features for pupil tracking from the event stream, outperforming conventional CNN structures. Utilizing a delta-encoded recurrent path enhancing activation sparsity, CB-ConvLSTM reduces arithmetic operations by approximately 4.7$\times$ without losing accuracy when tested on a \texttt{v2e}-generated event dataset of labeled pupils. This increase in efficiency makes it ideal for real-time eye tracking in resource-constrained devices. The project code and dataset are openly available at \url{https://github.com/qinche106/cb-convlstm-eyetracking}.

3ET: Efficient Event-based Eye Tracking using a Change-Based ConvLSTM Network

TL;DR

This work tackles real-time pupil tracking from sparse event streams using an efficient event-based approach for wearables. It introduces Change-Based ConvLSTM (CB-ConvLSTM), which injects temporal sparsity by using the thresholded hidden-change ΔH_{t-1} in gate computations, including a formal ΔH_{t-1} definition. On a synthetic DVS LPW pupil dataset, the method achieves roughly 85.3% temporal sparsity with a 4.7× reduction in arithmetic operations and maintains accuracy, outperforming CNN baselines by over 30%. The approach is well-suited for low-power AR/VR headsets and can benefit from hardware that exploits spatio-temporal sparsity; code and data are publicly available.

Abstract

This paper presents a sparse Change-Based Convolutional Long Short-Term Memory (CB-ConvLSTM) model for event-based eye tracking, key for next-generation wearable healthcare technology such as AR/VR headsets. We leverage the benefits of retina-inspired event cameras, namely their low-latency response and sparse output event stream, over traditional frame-based cameras. Our CB-ConvLSTM architecture efficiently extracts spatio-temporal features for pupil tracking from the event stream, outperforming conventional CNN structures. Utilizing a delta-encoded recurrent path enhancing activation sparsity, CB-ConvLSTM reduces arithmetic operations by approximately 4.7 without losing accuracy when tested on a \texttt{v2e}-generated event dataset of labeled pupils. This increase in efficiency makes it ideal for real-time eye tracking in resource-constrained devices. The project code and dataset are openly available at \url{https://github.com/qinche106/cb-convlstm-eyetracking}.
Paper Structure (7 sections, 4 equations, 6 figures, 2 tables)

This paper contains 7 sections, 4 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison between frames and events for the same 53 ms eye movement motion. A) Example video from the LPW dataset tonsen16_etra. Frames are sampled at a fixed frame rate (95 kHz); B) Using the v2e simulator hu2021v2e, the video frames in A are converted to realistic synthetic DVS event streams. In this example, 5 frames of size 240$\times$180 produce only 310 events.
  • Figure 2: Diverse set of images from LPW dataset tonsen16_etra. The first row shows different eye appearances. The second row shows some difficult cases, e.g. eyelid occlusion, glasses occlusion, and heavy makeup.
  • Figure 3: A set of continuous event-based frames using voxel grid representation from event-based LPW dataset using DVS simulator v2e tool.
  • Figure 4: The pupil tracking network using Change-Based ConvLSTM (CB-ConvLSTM) units on event-based LPW dataset
  • Figure 5: Detection rate under different sequence length
  • ...and 1 more figures