Table of Contents
Fetching ...

FAPNet: An Effective Frequency Adaptive Point-based Eye Tracker

Xiaopeng Lin, Hongwei Ren, Bojun Cheng

TL;DR

FAPNet tackles the need for ultra-fast, low-power eye tracking by processing raw event streams as Point Clouds, preserving fine-grained temporal information. It introduces a frequency-adaptive window and an Inter Sample LSTM to capture both short-term spatial-temporal and long-term sequential dependencies in a lightweight architecture. On SEET synthetic data, FAPNet achieves state-of-the-art performance with about 10% of PEPNet's FLOPs, and on real-world EET+ benchmarks maintains competitive accuracy while reducing computational load, with results largely independent of sensor resolution. The approach enables efficient, edge-friendly eye tracking suitable for near-eye applications in AR/VR and other domains.

Abstract

Eye tracking is crucial for human-computer interaction in different domains. Conventional cameras encounter challenges such as power consumption and image quality during different eye movements, prompting the need for advanced solutions with ultra-fast, low-power, and accurate eye trackers. Event cameras, fundamentally designed to capture information about moving objects, exhibit low power consumption and high temporal resolution. This positions them as an alternative to traditional cameras in the realm of eye tracking. Nevertheless, existing event-based eye tracking networks neglect the pivotal sparse and fine-grained temporal information in events, resulting in unsatisfactory performance. Moreover, the energy-efficient features are further compromised by the use of excessively complex models, hindering efficient deployment on edge devices. In this paper, we utilize Point Cloud as the event representation to harness the high temporal resolution and sparse characteristics of events in eye tracking tasks. We rethink the point-based architecture PEPNet with preprocessing the long-term relationships between samples, leading to the innovative design of FAPNet. A frequency adaptive mechanism is designed to realize adaptive tracking according to the speed of the pupil movement and the Inter Sample LSTM module is introduced to utilize the temporal correlation between samples. In the Event-based Eye Tracking Challenge, we utilize vanilla PEPNet, which is the former work to achieve the $p_{10}$ accuracy of 97.95\%. On the SEET synthetic dataset, FAPNet can achieve state-of-the-art while consuming merely 10\% of the PEPNet's computational resources. Notably, the computational demand of FAPNet is independent of the sensor's spatial resolution, enhancing its applicability on resource-limited edge devices.

FAPNet: An Effective Frequency Adaptive Point-based Eye Tracker

TL;DR

FAPNet tackles the need for ultra-fast, low-power eye tracking by processing raw event streams as Point Clouds, preserving fine-grained temporal information. It introduces a frequency-adaptive window and an Inter Sample LSTM to capture both short-term spatial-temporal and long-term sequential dependencies in a lightweight architecture. On SEET synthetic data, FAPNet achieves state-of-the-art performance with about 10% of PEPNet's FLOPs, and on real-world EET+ benchmarks maintains competitive accuracy while reducing computational load, with results largely independent of sensor resolution. The approach enables efficient, edge-friendly eye tracking suitable for near-eye applications in AR/VR and other domains.

Abstract

Eye tracking is crucial for human-computer interaction in different domains. Conventional cameras encounter challenges such as power consumption and image quality during different eye movements, prompting the need for advanced solutions with ultra-fast, low-power, and accurate eye trackers. Event cameras, fundamentally designed to capture information about moving objects, exhibit low power consumption and high temporal resolution. This positions them as an alternative to traditional cameras in the realm of eye tracking. Nevertheless, existing event-based eye tracking networks neglect the pivotal sparse and fine-grained temporal information in events, resulting in unsatisfactory performance. Moreover, the energy-efficient features are further compromised by the use of excessively complex models, hindering efficient deployment on edge devices. In this paper, we utilize Point Cloud as the event representation to harness the high temporal resolution and sparse characteristics of events in eye tracking tasks. We rethink the point-based architecture PEPNet with preprocessing the long-term relationships between samples, leading to the innovative design of FAPNet. A frequency adaptive mechanism is designed to realize adaptive tracking according to the speed of the pupil movement and the Inter Sample LSTM module is introduced to utilize the temporal correlation between samples. In the Event-based Eye Tracking Challenge, we utilize vanilla PEPNet, which is the former work to achieve the accuracy of 97.95\%. On the SEET synthetic dataset, FAPNet can achieve state-of-the-art while consuming merely 10\% of the PEPNet's computational resources. Notably, the computational demand of FAPNet is independent of the sensor's spatial resolution, enhancing its applicability on resource-limited edge devices.
Paper Structure (23 sections, 3 equations, 8 figures, 1 table)

This paper contains 23 sections, 3 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Real Data overview. The first row is the selected training samples and the second row is the selected testing samples. All the visualized samples are generated by stacking the events in a 50ms time interval.
  • Figure 2: Event density. The x-axis is the time in ms and the y-axis is the event count of each 10ms sample. The red arrow is the dynamic window length of each sample.
  • Figure 3: Correlation between event density and errors. The X axis is the event size level and the Y axis is the number of samples with errors above 3 pixels in each level. The higher the event size level, the larger the event numbers in each sample. The blue line is the fixed frequency eye tracking and the red line is the adaptive frequency tracking.
  • Figure 4: FAPNet's Network Architecture. The input Event Cloud is directly processed using a sliding window, sampling and normalization, eliminating the necessity for any format conversion. Each of the $S$ samples is fed into the network as a sequence. The input passes through $S_{sum}$ loops between the Sampling and Grouping module and the Intra Group Aggregation module for spatial feature abstraction and extraction. The input is then passed through a bidirectional LSTM to extract temporal features. The Inter Sample LSTM Module is designed to aggregate information within the sequence by the LTSM, culminating in a regressor responsible for eye tracking.
  • Figure 5: Visualization of the predicted pupil location in various scenarios. (a) is the case contains a large number of events. (b) is the case that existing lots of noise events, such as eyelashes. (c) is the scenario with a few events. (d) is the cases with different pitch angles of eyes.
  • ...and 3 more figures