Table of Contents
Fetching ...

A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

Yan Ru Pei, Sasskia Brüers, Sébastien Crouzet, Douglas McLelland, Olivier Coenen

TL;DR

This work proposes a causal spatiotemporal convolutional network that deliberately targets a simple architecture and set of operations and alleviates the problem of dataset scarcity for event-based systems.

Abstract

Event-based data are commonly encountered in edge computing environments where efficiency and low latency are critical. To interface with such data and leverage their rich temporal features, we propose a causal spatiotemporal convolutional network. This solution targets efficient implementation on edge-appropriate hardware with limited resources in three ways: 1) deliberately targets a simple architecture and set of operations (convolutions, ReLU activations) 2) can be configured to perform online inference efficiently via buffering of layer outputs 3) can achieve more than 90% activation sparsity through regularization during training, enabling very significant efficiency gains on event-based processors. In addition, we propose a general affine augmentation strategy acting directly on the events, which alleviates the problem of dataset scarcity for event-based systems. We apply our model on the AIS 2024 event-based eye tracking challenge, reaching a score of 0.9916 p10 accuracy on the Kaggle private testset.

A Lightweight Spatiotemporal Network for Online Eye Tracking with Event Camera

TL;DR

This work proposes a causal spatiotemporal convolutional network that deliberately targets a simple architecture and set of operations and alleviates the problem of dataset scarcity for event-based systems.

Abstract

Event-based data are commonly encountered in edge computing environments where efficiency and low latency are critical. To interface with such data and leverage their rich temporal features, we propose a causal spatiotemporal convolutional network. This solution targets efficient implementation on edge-appropriate hardware with limited resources in three ways: 1) deliberately targets a simple architecture and set of operations (convolutions, ReLU activations) 2) can be configured to perform online inference efficiently via buffering of layer outputs 3) can achieve more than 90% activation sparsity through regularization during training, enabling very significant efficiency gains on event-based processors. In addition, we propose a general affine augmentation strategy acting directly on the events, which alleviates the problem of dataset scarcity for event-based systems. We apply our model on the AIS 2024 event-based eye tracking challenge, reaching a score of 0.9916 p10 accuracy on the Kaggle private testset.
Paper Structure (21 sections, 6 equations, 3 figures, 3 tables)

This paper contains 21 sections, 6 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: A. A lightweight spatiotemporal architecture for efficient eye tracking. The backbone is composed of a succession of 5 spatiotemporal blocks. Each spatiotemporal block consists of a temporal convolution followed by a spatial convolution. B. The model can be configured to run in streaming inference mode by using an input FIFO buffer for each temporal layer. The sliding-window mechanism of the FIFO buffer would act as the convolution sliding window, and the convolution operation itself is simply replaced by a dot product between the elements in the FIFO buffer and kernel weights. C. Compares the methods of direct binning, event volume binning, and causal event volume binning. The last method retains temporal information while still being fully causal.
  • Figure 2: The average distance deviation with respect to batch size for networks using only causal group norms, only batch norms, and a mixture of causal group norms and batch norms.
  • Figure 3: Default parameters in the benchmark network: spatial downsampling factor of 5, 6 depthwise-separable (DWS) layers, and no activity regularization. A. and B. The average distance deviation of the model predictions vs. the MACs per inference frame, as the input spatial downsampling factor and number of DWS layers are varied respectively. C. The average distance deviation vs. the weighting of the $L_1$ regularization loss. D. The temporally average MACs per inference frame vs. the regularization weighting for a sparsity-aware and a sparsity-blind processor.