SPEED: Scalable Preprocessing of EEG Data for Self-Supervised Learning
Anders Gjølbye, Lina Skerath, William Lehn-Schiøler, Nicolas Langer, Lars Kai Hansen
TL;DR
This study addresses the challenge of leveraging large-scale, unlabeled EEG data for self-supervised learning by introducing SPEED, a scalable Python-based preprocessing pipeline tailored for SSL. SPEED integrates robust artifact removal, dynamic channel handling, ICA with ICLabel, interpolation, and standardized resampling to produce high-quality, reproducible inputs for SSL models like BENDR. Across TUH pretraining and downstream benchmarks (MMIDB, BC Bhutan, BCI@NER), SPEED improves contrastive learning stability and downstream classification accuracy, with notable gains over Baseline preprocessing; however, the impact of ICA-based artifact removal varies by task. The work demonstrates that judicious, scalable preprocessing can yield more discriminative latent representations and more reliable downstream performance, offering practical tools and logs for reproducibility and future EEG foundation-model development.
Abstract
Electroencephalography (EEG) research typically focuses on tasks with narrowly defined objectives, but recent studies are expanding into the use of unlabeled data within larger models, aiming for a broader range of applications. This addresses a critical challenge in EEG research. For example, Kostas et al. (2021) show that self-supervised learning (SSL) outperforms traditional supervised methods. Given the high noise levels in EEG data, we argue that further improvements are possible with additional preprocessing. Current preprocessing methods often fail to efficiently manage the large data volumes required for SSL, due to their lack of optimization, reliance on subjective manual corrections, and validation processes or inflexible protocols that limit SSL. We propose a Python-based EEG preprocessing pipeline optimized for self-supervised learning, designed to efficiently process large-scale data. This optimization not only stabilizes self-supervised training but also enhances performance on downstream tasks compared to training with raw data.
