MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking

Zhong Wang; Zengyu Wan; Han Han; Bohao Liao; Yuliang Wu; Wei Zhai; Yang Cao; Zheng-jun Zha

MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking

Zhong Wang, Zengyu Wan, Han Han, Bohao Liao, Yuliang Wu, Wei Zhai, Yang Cao, Zheng-jun Zha

TL;DR

MambaPupil addresses the challenge of stable pupil localization in event-based eye tracking under diverse motion patterns by bidirectionally modeling temporal context and selectively weighting informative time steps. The method combines a CNN-based spatial encoder, a Dual Recurrent Module (Bi-GRU plus Linear Time-Varying State Space Module), and the Bina-rep input representation with Event-Cutout augmentation. Empirical results on the EET+ dataset (ThreeET-plus benchmark) show state-of-the-art performance, with notable gains in $p_5$, $p_{10}$, and $p_{15}$ accuracy and reduced $p_{error}$, while maintaining efficiency. The approach demonstrates robust tracking across challenging conditions (blink, fast motion, rest) and offers a practical, low-cost solution for high-temporal-resolution eye tracking in HCI and VR/AR contexts.

Abstract

Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy provided by the event camera. However, the diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization. To achieve a stable event-based eye-tracking system, this paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information in response to the variability of eye movements. Specifically, the MambaPupil network is proposed, which consists of the multi-layer convolutional encoder to extract features from the event representations, a bidirectional Gated Recurrent Unit (GRU), and a Linear Time-Varying State Space Module (LTV-SSM), to selectively capture contextual correlation from the forward and backward temporal relationship. Furthermore, the Bina-rep is utilized as a compact event representation, and the tailor-made data augmentation, called as Event-Cutout, is proposed to enhance the model's robustness by applying spatial random masking to the event image. The evaluation on the ThreeET-plus benchmark shows the superior performance of the MambaPupil, which secured the 1st place in CVPR'2024 AIS Event-based Eye Tracking challenge.

MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking

TL;DR

, and

accuracy and reduced

, while maintaining efficiency. The approach demonstrates robust tracking across challenging conditions (blink, fast motion, rest) and offers a practical, low-cost solution for high-temporal-resolution eye tracking in HCI and VR/AR contexts.

Abstract

Paper Structure (15 sections, 7 equations, 6 figures, 5 tables)

This paper contains 15 sections, 7 equations, 6 figures, 5 tables.

Introduction
Related Work
Eye Tracking
State Space Model (SSM)
Methodology
Event Processing
Architecture of Proposed Network
Loss Function for Training
Experiments
Datasets and Preprocessing
Implementation Details
Comparision with Other Methods
Ablation Study
Qualitative Comparison
Conclusion

Figures (6)

Figure 1: The diversity of eye movements, including blinking, saccades, resting, etc., poses significant challenges for accurate and stable eye tracking. To address these challenges, leveraging contextual temporal information becomes crucial in assisting with eye localization. The historical and future data of eye movements can help to infer the current position and potential trajectories of the eye.
Figure 2: Top: Event streams are transformed into Bina-rep representations and enhanced only during the training phase. Bottom: MambaPupil framework is composed of the Spatial Feature Extractor, Dual Recurrent Module, and the Fully Connected layer as the final classifier. The Spatial Feature Extractor consists of stacked convolutional blocks. Right: The Dual Recurrent Module consists of two temporal modeling units, Bi-GRU and LTV-SSM. The Bi-GRU extracts the complete contextual information through the bi-directional temporal information flow process, and LTV-SSM achieves selective temporal state encoding through content-dependent parameter tuning.
Figure 3: The Bina-rep is generated by aggregating binary-masked event frames. The routines for each polarity of event are the same and will be concatenated at last.
Figure 4: Various data enhancement methods are adapted to improve the robustness, including spatial flip, spatial shift, event-cutout and etc. Event-Cutout is a spatial enhancement technique suitable for event-based eye tracking, which sets the pixels within random rectangular box to zero to simulate potential external interference, e.g., eye blinking.
Figure 5: Visual comparison of state-of-the-art model and our MambaPupil in four challenging scenarios, including onset, blink, fast move and eye rest.
...and 1 more figures

MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking

TL;DR

Abstract

MambaPupil: Bidirectional Selective Recurrent model for Event-based Eye tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (6)