A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

Yan Ru Pei; Sasskia Brüers; Sébastien Crouzet; Douglas McLelland; Olivier Coenen

A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

Yan Ru Pei, Sasskia Brüers, Sébastien Crouzet, Douglas McLelland, Olivier Coenen

TL;DR

This work proposes a causal spatiotemporal convolutional network that deliberately targets a simple architecture and set of operations and alleviates the problem of dataset scarcity for event-based systems.

Abstract

Event-based data are commonly encountered in edge computing environments where efficiency and low latency are critical. To interface with such data and leverage their rich temporal features, we propose a causal spatiotemporal convolutional network. This solution targets efficient implementation on edge-appropriate hardware with limited resources in three ways: 1) deliberately targets a simple architecture and set of operations (convolutions, ReLU activations) 2) can be configured to perform online inference efficiently via buffering of layer outputs 3) can achieve more than 90% activation sparsity through regularization during training, enabling very significant efficiency gains on event-based processors. In addition, we propose a general affine augmentation strategy acting directly on the events, which alleviates the problem of dataset scarcity for event-based systems. We apply our model on the AIS 2024 event-based eye tracking challenge, reaching a score of 0.9916 p10 accuracy on the Kaggle private testset.

A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

TL;DR

Abstract

Paper Structure (21 sections, 6 equations, 3 figures, 3 tables)

This paper contains 21 sections, 6 equations, 3 figures, 3 tables.

Introduction
Related Work
Event binning methods
Spatiotemporal networks
Lightweight detector heads
Event data processing
Event volume binning
Causal Event volume binning
Spatial affine transformation of events
Temporal affine transformation of events
Network Architecture
Causal spatiotemporal convolutional block
Causal group normalization
Configuration for online inference with FIFO buffers
The detector head and loss
...and 6 more sections

Figures (3)

Figure 1: A. A lightweight spatiotemporal architecture for efficient eye tracking. The backbone is composed of a succession of 5 spatiotemporal blocks. Each spatiotemporal block consists of a temporal convolution followed by a spatial convolution. B. The model can be configured to run in streaming inference mode by using an input FIFO buffer for each temporal layer. The sliding-window mechanism of the FIFO buffer would act as the convolution sliding window, and the convolution operation itself is simply replaced by a dot product between the elements in the FIFO buffer and kernel weights. C. Compares the methods of direct binning, event volume binning, and causal event volume binning. The last method retains temporal information while still being fully causal.
Figure 2: The average distance deviation with respect to batch size for networks using only causal group norms, only batch norms, and a mixture of causal group norms and batch norms.
Figure 3: Default parameters in the benchmark network: spatial downsampling factor of 5, 6 depthwise-separable (DWS) layers, and no activity regularization. A. and B. The average distance deviation of the model predictions vs. the MACs per inference frame, as the input spatial downsampling factor and number of DWS layers are varied respectively. C. The average distance deviation vs. the weighting of the $L_1$ regularization loss. D. The temporally average MACs per inference frame vs. the regularization weighting for a sparsity-aware and a sparsity-blind processor.

A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

TL;DR

Abstract

A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

Authors

TL;DR

Abstract

Table of Contents

Figures (3)