Is Limited Participant Diversity Impeding EEG-based Machine Learning?
Philipp Bomatter, Henry Gouk
TL;DR
This paper investigates how limited participant diversity affects EEG-based machine learning generalization by formalizing a two-level data generation process and evaluating scaling across datasets TUAB, CAUEEG, and PhysioNet for tasks EEG normality prediction, dementia diagnosis, and sleep staging. It analyzes how model performance scales with the number of participants $n$ and segments per participant $m$, showing that the generalization gap is dominated by distribution shifts across participants, consistent with a two-level scaling perspective $\tilde{\Theta}(1/n^{1/2} + 1/(mn)^{1/2})$. The study evaluates EEG data augmentations and self-supervised pre-training (LaBraM) in data-limited regimes, finding augmentations largely ineffective for addressing participant diversity, while self-supervised pre-training yields robust improvements across tasks and datasets. Practically, the work suggests prioritizing broader participant diversity and leveraging pre-training or distribution-shift-aware strategies to improve EEG-ML performance, as simply collecting more data per participant often provides limited gains and benefits saturate at scale.
Abstract
The application of machine learning (ML) to electroencephalography (EEG) has great potential to advance both neuroscientific research and clinical applications. However, the generalisability and robustness of EEG-based ML models often hinge on the amount and diversity of training data. It is common practice to split EEG recordings into small segments, thereby increasing the number of samples substantially compared to the number of individual recordings or participants. We conceptualise this as a multi-level data generation process and investigate the scaling behaviour of model performance with respect to the overall sample size and the participant diversity through large-scale empirical studies. We then use the same framework to investigate the effectiveness of different ML strategies designed to address limited data problems: data augmentations and self-supervised learning. Our findings show that model performance scaling can be severely constrained by participant distribution shifts and provide actionable guidance for data collection and ML research. The code for our experiments is publicly available online.
