Table of Contents
Fetching ...

Is Limited Participant Diversity Impeding EEG-based Machine Learning?

Philipp Bomatter, Henry Gouk

TL;DR

This paper investigates how limited participant diversity affects EEG-based machine learning generalization by formalizing a two-level data generation process and evaluating scaling across datasets TUAB, CAUEEG, and PhysioNet for tasks EEG normality prediction, dementia diagnosis, and sleep staging. It analyzes how model performance scales with the number of participants $n$ and segments per participant $m$, showing that the generalization gap is dominated by distribution shifts across participants, consistent with a two-level scaling perspective $\tilde{\Theta}(1/n^{1/2} + 1/(mn)^{1/2})$. The study evaluates EEG data augmentations and self-supervised pre-training (LaBraM) in data-limited regimes, finding augmentations largely ineffective for addressing participant diversity, while self-supervised pre-training yields robust improvements across tasks and datasets. Practically, the work suggests prioritizing broader participant diversity and leveraging pre-training or distribution-shift-aware strategies to improve EEG-ML performance, as simply collecting more data per participant often provides limited gains and benefits saturate at scale.

Abstract

The application of machine learning (ML) to electroencephalography (EEG) has great potential to advance both neuroscientific research and clinical applications. However, the generalisability and robustness of EEG-based ML models often hinge on the amount and diversity of training data. It is common practice to split EEG recordings into small segments, thereby increasing the number of samples substantially compared to the number of individual recordings or participants. We conceptualise this as a multi-level data generation process and investigate the scaling behaviour of model performance with respect to the overall sample size and the participant diversity through large-scale empirical studies. We then use the same framework to investigate the effectiveness of different ML strategies designed to address limited data problems: data augmentations and self-supervised learning. Our findings show that model performance scaling can be severely constrained by participant distribution shifts and provide actionable guidance for data collection and ML research. The code for our experiments is publicly available online.

Is Limited Participant Diversity Impeding EEG-based Machine Learning?

TL;DR

This paper investigates how limited participant diversity affects EEG-based machine learning generalization by formalizing a two-level data generation process and evaluating scaling across datasets TUAB, CAUEEG, and PhysioNet for tasks EEG normality prediction, dementia diagnosis, and sleep staging. It analyzes how model performance scales with the number of participants and segments per participant , showing that the generalization gap is dominated by distribution shifts across participants, consistent with a two-level scaling perspective . The study evaluates EEG data augmentations and self-supervised pre-training (LaBraM) in data-limited regimes, finding augmentations largely ineffective for addressing participant diversity, while self-supervised pre-training yields robust improvements across tasks and datasets. Practically, the work suggests prioritizing broader participant diversity and leveraging pre-training or distribution-shift-aware strategies to improve EEG-ML performance, as simply collecting more data per participant often provides limited gains and benefits saturate at scale.

Abstract

The application of machine learning (ML) to electroencephalography (EEG) has great potential to advance both neuroscientific research and clinical applications. However, the generalisability and robustness of EEG-based ML models often hinge on the amount and diversity of training data. It is common practice to split EEG recordings into small segments, thereby increasing the number of samples substantially compared to the number of individual recordings or participants. We conceptualise this as a multi-level data generation process and investigate the scaling behaviour of model performance with respect to the overall sample size and the participant diversity through large-scale empirical studies. We then use the same framework to investigate the effectiveness of different ML strategies designed to address limited data problems: data augmentations and self-supervised learning. Our findings show that model performance scaling can be severely constrained by participant distribution shifts and provide actionable guidance for data collection and ML research. The code for our experiments is publicly available online.

Paper Structure

This paper contains 29 sections, 1 equation, 12 figures, 2 tables.

Figures (12)

  • Figure 1: (A) Multi-level EEG data distribution. A dataset contains EEG recordings from multiple participants, which are divided into smaller segments to train machine learning models. (B) Grid visualising the subsampling of training datasets controlling for both the number of participants and the number of segments per participant. (C) The datasets, tasks, and models used in the experiments.
  • Figure 2: Scaling behaviour of model performance. Average accuracies of the different models for normality prediction on TUAB, dementia diagnosis on CAUEEG, and sleep staging on PhysioNet for increasing participant counts at a fixed number of segments per participant ($m=40$). Averages were computed across seeds used to subsample the training datasets and the shading illustrates the standard error of the mean. Across all datasets, performance increased strongly as the size was increased to several hundred participants, after which improvements started to diminish.
  • Figure 3: Differential effects of participant count and overall sample size on model performance. Accuracies of the TCN model for normality prediction on TUAB, dementia diagnosis on CAUEEG, and sleep staging on PhysioNet averaged across seeds, along with the standard error of the mean. Performance on TUAB and CAUEEG was dominated by the participant count. Sleep staging performance on PhysioNet was more dependent on the overall sample size and competitive performance was achieved even with very limited participant counts.
  • Figure 4: Impact of data augmentations. Each boxplot shows pairwise accuracy differences between augmented and unaugmented training (positive means augmentation improved performance) across all combinations of participant counts ($n$) and segments per participant ($m$). Results for a single seed are shown for better visibility. AS = AmplitudeScaling, FS = FrequencyShift, PR = PhaseRandomisation. On PhysioNet, the LaBraM model benefitted from FS and PR, whereas PR decreased performance for TCN and mAtt. Other than that, augmentations did not lead to consistent improvements in performance.
  • Figure 5: Effectiveness of self-supervised pre-training. Comparison of LaBraM performance with and without pre-training on TUAB. The left and middle heatmaps show average accuracies (± standard error of the mean) for LaBraM trained from scratch (baseline) and pre-trained on a collection of 16 datasets before fine-tuning on TUAB, respectively. The right heatmap visualises the average of pairwise accuracy differences (± standard error of the mean), where pairs correspond to the accuracies of pre-trained and baseline models for the same seed. Pre-training consistently improved performance across all data regimes, except when the amount of fine-tuning data was most severely limited.
  • ...and 7 more figures