Effect of Kernel Size on CNN-Vision-Transformer-Based Gaze Prediction Using Electroencephalography Data
Chuhui Qiu, Bugao Liang, Matthew L Key
TL;DR
This work investigates how kernel size in CNN–vision transformer hybrids affects EEG-based gaze prediction using the EEGEyeNet dataset. By employing a two-stage front-end with a full-channel depth-wise spatial convolution and a ViT backbone, the method achieves better accuracy than the current SOTA EEGViT while reducing training time. The approach demonstrates that learning across all EEG channels with a large spatial kernel yields robust spatial relationships, though real-world deployment remains challenging due to persisting accuracy and speed gaps relative to video-based eye-tracking. The results underscore the potential of CNN–transformer hybrids with broad channel receptive fields for EEG-based gaze estimation and point to future work on richer datasets and real-world applicability.
Abstract
In this paper, we present an algorithm of gaze prediction from Electroencephalography (EEG) data. EEG-based gaze prediction is a new research topic that can serve as an alternative to traditional video-based eye-tracking. Compared to the existing state-of-the-art (SOTA) method, we improved the root mean-squared-error of EEG-based gaze prediction to 53.06 millimeters, while reducing the training time to less than 33% of its original duration. Our source code can be found at https://github.com/AmCh-Q/CSCI6907Project
