Table of Contents
Fetching ...

Exploring Temporal Dynamics in Event-based Eye Tracker

Hongwei Ren, Xiaopeng Lin, Hongxiang Huang, Yue Zhou, Bojun Cheng

TL;DR

TDTracker addresses the challenge of high-speed eye tracking with event cameras by explicitly modeling long-term and short-term temporal dynamics. It combines ITD via 3D convolutions on a Binary Map representation with ETD that cascades FFT-based frequency processing, GRU, and Mamba to capture rich temporal dependencies, followed by heatmap-based coordinate regression trained with KL divergence. The method achieves state-of-the-art performance on SEET and places third in the CVPR 2025 Event-Based Eye Tracking Challenge, while maintaining favorable computational efficiency. The work highlights the importance of accurate temporal representation and heatmap regression for robust, low-latency gaze estimation in dynamic, low-power wearable scenarios.

Abstract

Eye-tracking is a vital technology for human-computer interaction, especially in wearable devices such as AR, VR, and XR. The realization of high-speed and high-precision eye-tracking using frame-based image sensors is constrained by their limited temporal resolution, which impairs the accurate capture of rapid ocular dynamics, such as saccades and blinks. Event cameras, inspired by biological vision systems, are capable of perceiving eye movements with extremely low power consumption and ultra-high temporal resolution. This makes them a promising solution for achieving high-speed, high-precision tracking with rich temporal dynamics. In this paper, we propose TDTracker, an effective eye-tracking framework that captures rapid eye movements by thoroughly modeling temporal dynamics from both implicit and explicit perspectives. TDTracker utilizes 3D convolutional neural networks to capture implicit short-term temporal dynamics and employs a cascaded structure consisting of a Frequency-aware Module, GRU, and Mamba to extract explicit long-term temporal dynamics. Ultimately, a prediction heatmap is used for eye coordinate regression. Experimental results demonstrate that TDTracker achieves state-of-the-art (SOTA) performance on the synthetic SEET dataset and secured Third place in the CVPR event-based eye-tracking challenge 2025. Our code is available at https://github.com/rhwxmx/TDTracker.

Exploring Temporal Dynamics in Event-based Eye Tracker

TL;DR

TDTracker addresses the challenge of high-speed eye tracking with event cameras by explicitly modeling long-term and short-term temporal dynamics. It combines ITD via 3D convolutions on a Binary Map representation with ETD that cascades FFT-based frequency processing, GRU, and Mamba to capture rich temporal dependencies, followed by heatmap-based coordinate regression trained with KL divergence. The method achieves state-of-the-art performance on SEET and places third in the CVPR 2025 Event-Based Eye Tracking Challenge, while maintaining favorable computational efficiency. The work highlights the importance of accurate temporal representation and heatmap regression for robust, low-latency gaze estimation in dynamic, low-power wearable scenarios.

Abstract

Eye-tracking is a vital technology for human-computer interaction, especially in wearable devices such as AR, VR, and XR. The realization of high-speed and high-precision eye-tracking using frame-based image sensors is constrained by their limited temporal resolution, which impairs the accurate capture of rapid ocular dynamics, such as saccades and blinks. Event cameras, inspired by biological vision systems, are capable of perceiving eye movements with extremely low power consumption and ultra-high temporal resolution. This makes them a promising solution for achieving high-speed, high-precision tracking with rich temporal dynamics. In this paper, we propose TDTracker, an effective eye-tracking framework that captures rapid eye movements by thoroughly modeling temporal dynamics from both implicit and explicit perspectives. TDTracker utilizes 3D convolutional neural networks to capture implicit short-term temporal dynamics and employs a cascaded structure consisting of a Frequency-aware Module, GRU, and Mamba to extract explicit long-term temporal dynamics. Ultimately, a prediction heatmap is used for eye coordinate regression. Experimental results demonstrate that TDTracker achieves state-of-the-art (SOTA) performance on the synthetic SEET dataset and secured Third place in the CVPR event-based eye-tracking challenge 2025. Our code is available at https://github.com/rhwxmx/TDTracker.

Paper Structure

This paper contains 22 sections, 16 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The architecture of TDTracker. TDTracker primarily comprises two components, Implicit Temporal Dynamic (ITD) and Explicit Temporal Dynamic (ETD), with a structure featuring three ITD components to ensure effective feature abstraction. It employs a cascaded architecture of three distinct time series models to comprehensively capture temporal information.
  • Figure 2: The visualization heatmap generated by the TDTracker.
  • Figure 3: The visualization results of the TDTracker. The green dot in the figure stands for the Ground Truth label and the yellow dot is the prediction results genetated by TDTracker.
  • Figure 4: The visualization trajectory comparison between the Ground Truth label and the prediction results gererated by the TDTracker.