Table of Contents
Fetching ...

EmMixformer: Mix transformer for eye movement recognition

Huafeng Qin, Hongyu Zhu, Xin Jin, Qun Song, Mounim A. El-Yacoubi, Xinbo Gao

TL;DR

This work tackles eye movement biometric recognition by exploiting both temporal and spectral information. It introduces EmMixformer, a mixed transformer architecture that fuses a Transformer, an attention-enhanced LSTM, and a Fourier-domain transformer within a Siamese CNN-based pipeline to capture long-range temporal dependencies and global frequency features. The approach is validated on a self-collected low-frequency EM dataset and two public datasets (GazeBase and JuDo1000), achieving state-of-the-art verification performance and demonstrating robustness to short and long time intervals. The results suggest significant potential for secure, real-time eye movement biometrics and motivate future multimodal fusion and larger-scale data collection.

Abstract

Eye movement (EM) is a new highly secure biometric behavioral modality that has received increasing attention in recent years. Although deep neural networks, such as convolutional neural network (CNN), have recently achieved promising performance, current solutions fail to capture local and global temporal dependencies within eye movement data. To overcome this problem, we propose in this paper a mixed transformer termed EmMixformer to extract time and frequency domain information for eye movement recognition. To this end, we propose a mixed block consisting of three modules, transformer, attention Long short-term memory (attention LSTM), and Fourier transformer. We are the first to attempt leveraging transformer to learn long temporal dependencies within eye movement. Second, we incorporate the attention mechanism into LSTM to propose attention LSTM with the aim to learn short temporal dependencies. Third, we perform self attention in the frequency domain to learn global features. As the three modules provide complementary feature representations in terms of local and global dependencies, the proposed EmMixformer is capable of improving recognition accuracy. The experimental results on our eye movement dataset and two public eye movement datasets show that the proposed EmMixformer outperforms the state of the art by achieving the lowest verification error.

EmMixformer: Mix transformer for eye movement recognition

TL;DR

This work tackles eye movement biometric recognition by exploiting both temporal and spectral information. It introduces EmMixformer, a mixed transformer architecture that fuses a Transformer, an attention-enhanced LSTM, and a Fourier-domain transformer within a Siamese CNN-based pipeline to capture long-range temporal dependencies and global frequency features. The approach is validated on a self-collected low-frequency EM dataset and two public datasets (GazeBase and JuDo1000), achieving state-of-the-art verification performance and demonstrating robustness to short and long time intervals. The results suggest significant potential for secure, real-time eye movement biometrics and motivate future multimodal fusion and larger-scale data collection.

Abstract

Eye movement (EM) is a new highly secure biometric behavioral modality that has received increasing attention in recent years. Although deep neural networks, such as convolutional neural network (CNN), have recently achieved promising performance, current solutions fail to capture local and global temporal dependencies within eye movement data. To overcome this problem, we propose in this paper a mixed transformer termed EmMixformer to extract time and frequency domain information for eye movement recognition. To this end, we propose a mixed block consisting of three modules, transformer, attention Long short-term memory (attention LSTM), and Fourier transformer. We are the first to attempt leveraging transformer to learn long temporal dependencies within eye movement. Second, we incorporate the attention mechanism into LSTM to propose attention LSTM with the aim to learn short temporal dependencies. Third, we perform self attention in the frequency domain to learn global features. As the three modules provide complementary feature representations in terms of local and global dependencies, the proposed EmMixformer is capable of improving recognition accuracy. The experimental results on our eye movement dataset and two public eye movement datasets show that the proposed EmMixformer outperforms the state of the art by achieving the lowest verification error.
Paper Structure (23 sections, 35 equations, 8 figures, 10 tables)

This paper contains 23 sections, 35 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Eye Musculature.
  • Figure 2: The framework of the proposed EmMixformer model
  • Figure 3: Attention LSTM
  • Figure 4: Fourier transformer
  • Figure 5: Data collecting
  • ...and 3 more figures