Table of Contents
Fetching ...

Quantity versus Diversity: Influence of Data on Detecting EEG Pathology with Advanced ML Models

Martyna Poziomska, Marian Dovgialo, Przemysław Olbratowski, Paweł Niedbalski, Paweł Ogniewski, Joanna Zych, Jacek Rogala, Jarosław Żygierewicz

TL;DR

It is shown that small and consistent datasets enable a wide range of models to achieve high accuracy; however, variations in pathological conditions, recording protocols, and labeling standards lead to significant performance degradation.

Abstract

This study investigates the impact of quantity and diversity of data on the performance of various machine-learning models for detecting general EEG pathology. We utilized an EEG dataset of 2,993 recordings from Temple University Hospital and a dataset of 55,787 recordings from Elmiko Biosignals sp. z o.o. The latter contains data from 39 hospitals and a diverse patient set with varied conditions. Thus, we introduce the Elmiko dataset - the largest publicly available EEG corpus. Our findings show that small and consistent datasets enable a wide range of models to achieve high accuracy; however, variations in pathological conditions, recording protocols, and labeling standards lead to significant performance degradation. Nonetheless, increasing the number of available recordings improves predictive accuracy and may even compensate for data diversity, particularly in neural networks based on attention mechanism or transformer architecture. A meta-model that combined these networks with a gradient-boosting approach using handcrafted features demonstrated superior performance across varied datasets.

Quantity versus Diversity: Influence of Data on Detecting EEG Pathology with Advanced ML Models

TL;DR

It is shown that small and consistent datasets enable a wide range of models to achieve high accuracy; however, variations in pathological conditions, recording protocols, and labeling standards lead to significant performance degradation.

Abstract

This study investigates the impact of quantity and diversity of data on the performance of various machine-learning models for detecting general EEG pathology. We utilized an EEG dataset of 2,993 recordings from Temple University Hospital and a dataset of 55,787 recordings from Elmiko Biosignals sp. z o.o. The latter contains data from 39 hospitals and a diverse patient set with varied conditions. Thus, we introduce the Elmiko dataset - the largest publicly available EEG corpus. Our findings show that small and consistent datasets enable a wide range of models to achieve high accuracy; however, variations in pathological conditions, recording protocols, and labeling standards lead to significant performance degradation. Nonetheless, increasing the number of available recordings improves predictive accuracy and may even compensate for data diversity, particularly in neural networks based on attention mechanism or transformer architecture. A meta-model that combined these networks with a gradient-boosting approach using handcrafted features demonstrated superior performance across varied datasets.

Paper Structure

This paper contains 26 sections, 2 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Age, sex, and diagnosis distributions in the TUH and ELM$_{19}$ databases.
  • Figure 2: Distribution of recording lengths in the TUH and ELM$_{19}$ databases after data preprocessing. The length is expressed as the number of 6-second frames.
  • Figure 3: Architecture of the EEGNet frame encoder, which processes a 6-second frame of the EEG signal and outputs an encoding as a vector of 288 features. This encoder is reimplemented from Gemein2020 and lawhern2018eegnet and has 1,408 parameters. Arrows represent layers and operations, while boxes indicate tensor shapes. Refer to Fig. \ref{['fig:colors']} for an explanation of the color code.
  • Figure 4: Architecture used to train the siNet classifier and encoder, as well as to pretrain the encoder on single frames for later use with the miNet, MINet, and TransNet models. This network has 1,697 parameters.
  • Figure 5: Architecture used by the siNet model for prediction and by the miNet model for both prediction and training. This network has 1,697 parameters. Although three frames are shown for illustration, the model can process an entire recording composed of an arbitrary number of frames.
  • ...and 12 more figures