Table of Contents
Fetching ...

Dual-Path Enhancements in Event-Based Eye Tracking: Augmented Robustness and Adaptive Temporal Modeling

Hoang M. Truong, Vinh-Thuan Ly, Huy G. Tran, Thuan-Phat Nguyen, Tram T. Doan

TL;DR

This paper tackles robust, real-time gaze estimation from event-based eye-tracking data under real-world noise and rapid eye movements. It introduces two complementary strategies: augmented robustness for a lightweight spatiotemporal baseline and KnightPupil, a hybrid architecture that fuses EfficientNet-B3 spatial encoding with Bi-GRU temporal modeling and a dynamic Linear Time-Varying State-Space Model for adaptive temporal transitions. On the 3ET+ benchmark, augmentation improves robustness while KnightPupil delivers strong edge-deployable performance, achieving competitive Euclidean error and p10 metrics. The proposed dual-path framework balances deployable efficiency with adaptive temporal modeling, offering a solid foundation for future neuromorphic-vision developments in AR/VR and neuro-oculomotor analysis.

Abstract

Event-based eye tracking has become a pivotal technology for augmented reality and human-computer interaction. Yet, existing methods struggle with real-world challenges such as abrupt eye movements and environmental noise. Building on the efficiency of the Lightweight Spatiotemporal Network-a causal architecture optimized for edge devices-we introduce two key advancements. First, a robust data augmentation pipeline incorporating temporal shift, spatial flip, and event deletion improves model resilience, reducing Euclidean distance error by 12% (1.61 vs. 1.70 baseline) on challenging samples. Second, we propose KnightPupil, a hybrid architecture combining an EfficientNet-B3 backbone for spatial feature extraction, a bidirectional GRU for contextual temporal modeling, and a Linear Time-Varying State-Space Module to adapt to sparse inputs and noise dynamically. Evaluated on the 3ET+ benchmark, our framework achieved 1.61 Euclidean distance on the private test set of the Event-based Eye Tracking Challenge at CVPR 2025, demonstrating its effectiveness for practical deployment in AR/VR systems while providing a foundation for future innovations in neuromorphic vision.

Dual-Path Enhancements in Event-Based Eye Tracking: Augmented Robustness and Adaptive Temporal Modeling

TL;DR

This paper tackles robust, real-time gaze estimation from event-based eye-tracking data under real-world noise and rapid eye movements. It introduces two complementary strategies: augmented robustness for a lightweight spatiotemporal baseline and KnightPupil, a hybrid architecture that fuses EfficientNet-B3 spatial encoding with Bi-GRU temporal modeling and a dynamic Linear Time-Varying State-Space Model for adaptive temporal transitions. On the 3ET+ benchmark, augmentation improves robustness while KnightPupil delivers strong edge-deployable performance, achieving competitive Euclidean error and p10 metrics. The proposed dual-path framework balances deployable efficiency with adaptive temporal modeling, offering a solid foundation for future neuromorphic-vision developments in AR/VR and neuro-oculomotor analysis.

Abstract

Event-based eye tracking has become a pivotal technology for augmented reality and human-computer interaction. Yet, existing methods struggle with real-world challenges such as abrupt eye movements and environmental noise. Building on the efficiency of the Lightweight Spatiotemporal Network-a causal architecture optimized for edge devices-we introduce two key advancements. First, a robust data augmentation pipeline incorporating temporal shift, spatial flip, and event deletion improves model resilience, reducing Euclidean distance error by 12% (1.61 vs. 1.70 baseline) on challenging samples. Second, we propose KnightPupil, a hybrid architecture combining an EfficientNet-B3 backbone for spatial feature extraction, a bidirectional GRU for contextual temporal modeling, and a Linear Time-Varying State-Space Module to adapt to sparse inputs and noise dynamically. Evaluated on the 3ET+ benchmark, our framework achieved 1.61 Euclidean distance on the private test set of the Event-based Eye Tracking Challenge at CVPR 2025, demonstrating its effectiveness for practical deployment in AR/VR systems while providing a foundation for future innovations in neuromorphic vision.

Paper Structure

This paper contains 27 sections, 12 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A compact spatiotemporal model integrating data augmentation with spatial and temporal processing blocks. Convolutional layers extract spatial and temporal features efficiently.
  • Figure 3: Overview of data augmentation techniques: Spatial Flip mirrors event coordinates, Temporal Shift modifies event timing, and Event Deletion simulates sensor noise.
  • Figure 4: Voxel grid representation of event data, where raw events are segmented into windows and accumulated into spatial bins over time.
  • Figure 5: Overview of the KnightPupil architecture. The model consists of three key components: (1) an EfficientNet-B3 backbone for spatial feature extraction, (2) a Bidirectional GRU (Bi-GRU) for temporal modeling, and (3) a Linear Time-Varying State-Space Model (LTV-SSM) for adaptive state transitions. This design enables robust event-based gaze estimation by efficiently capturing spatial and temporal dependencies.