SPEED: Scalable Preprocessing of EEG Data for Self-Supervised Learning

Anders Gjølbye; Lina Skerath; William Lehn-Schiøler; Nicolas Langer; Lars Kai Hansen

SPEED: Scalable Preprocessing of EEG Data for Self-Supervised Learning

Anders Gjølbye, Lina Skerath, William Lehn-Schiøler, Nicolas Langer, Lars Kai Hansen

TL;DR

This study addresses the challenge of leveraging large-scale, unlabeled EEG data for self-supervised learning by introducing SPEED, a scalable Python-based preprocessing pipeline tailored for SSL. SPEED integrates robust artifact removal, dynamic channel handling, ICA with ICLabel, interpolation, and standardized resampling to produce high-quality, reproducible inputs for SSL models like BENDR. Across TUH pretraining and downstream benchmarks (MMIDB, BC Bhutan, BCI@NER), SPEED improves contrastive learning stability and downstream classification accuracy, with notable gains over Baseline preprocessing; however, the impact of ICA-based artifact removal varies by task. The work demonstrates that judicious, scalable preprocessing can yield more discriminative latent representations and more reliable downstream performance, offering practical tools and logs for reproducibility and future EEG foundation-model development.

Abstract

Electroencephalography (EEG) research typically focuses on tasks with narrowly defined objectives, but recent studies are expanding into the use of unlabeled data within larger models, aiming for a broader range of applications. This addresses a critical challenge in EEG research. For example, Kostas et al. (2021) show that self-supervised learning (SSL) outperforms traditional supervised methods. Given the high noise levels in EEG data, we argue that further improvements are possible with additional preprocessing. Current preprocessing methods often fail to efficiently manage the large data volumes required for SSL, due to their lack of optimization, reliance on subjective manual corrections, and validation processes or inflexible protocols that limit SSL. We propose a Python-based EEG preprocessing pipeline optimized for self-supervised learning, designed to efficiently process large-scale data. This optimization not only stabilizes self-supervised training but also enhances performance on downstream tasks compared to training with raw data.

SPEED: Scalable Preprocessing of EEG Data for Self-Supervised Learning

TL;DR

Abstract

Paper Structure (27 sections, 5 figures, 1 table)

This paper contains 27 sections, 5 figures, 1 table.

Introduction
Background
Data
TUH EEG Corpus (TUEG) obeid2016temple
Motor Movement/Imagery Dataset (MMIDB) mmidb_dataset
BrainCapture Bhutan Dataset v. 4.1 (BC Bhutan)
BCI Challenge @ NER 2015 (BCI@NER) ner-bci-challenge
Methods
Initial Setup of preprocessing
Quality assessment
Iterative Zapline
Bad Channel Detection
Filtering & Average Referencing
Independent Component Analysis
Interpolation & Resample
...and 12 more sections

Figures (5)

Figure 1: An overview of the Scalable Preprocessing for EEG Data (SPEED) pipeline and the SPEED w/ICA pipeline. These pipelines are crafted to manage massive amounts of EEG data and are intended for use with a self-supervised learning (SSL) model. The red colour highlights removing faulty or unnecessary data, while dotted lines signify pipeline components that are excluded when preprocessing downstream datasets.
Figure 2: Distribution of interpolated channels during preprocessing with SPEED and SPEED w/ ICA. Most of the channels are interpolated for being detected as bad while a few instances for Fz and Pz were originally missing.
Figure 3: Distribution of the independent component (ICs) classification by ICLabel during preprocessing with the SPEED pipeline on The TUH EEG Corpus. Most of the ICs are classified as brain and other as expected.
Figure 4: UMAP embeddings of the Bhutan dataset. The subplots show the distribution of the 5 artifact classes as scatterplots. The density of the whole dataset is represented by the contour plot. The representations based on SPEED show better alignment with the ground truth labels.
Figure 5: Validation contrastive accuracy during pretraining for three different versions of preprocessed datasets; SPEED, SPEED w/ ICA, and Baseline. The models with SPEED and SPEED w/ ICA offer more stable training and achieve higher scores.

SPEED: Scalable Preprocessing of EEG Data for Self-Supervised Learning

TL;DR

Abstract

SPEED: Scalable Preprocessing of EEG Data for Self-Supervised Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)