Table of Contents
Fetching ...

NeuroClean: A Generalized Machine-Learning Approach to Neural Time-Series Conditioning

Manuel A. Hernandez Alonso, Michael Depass, Stephan Quessy, Numa Dancause, Ignasi Cos

TL;DR

This work tackles the challenge of automated, reproducible preprocessing of EEG/LFP time‑series by introducing NeuroClean, an unsupervised five‑step pipeline (bandpass filtering, ZapLine line noise removal, bad‑channel rejection, ICA with Cluster‑MARA, and optional epoching). The pipeline is designed to generalize across diverse experimental setups and to reduce human bias, with validation on high‑dimensional macaque LFP motor data showing substantial improvements in downstream classification metrics and spectral fidelity ($1/f$‑like brain activity). Central to NeuroClean is the Cluster‑MARA method, which uses DBSCAN clustering on MARA features to automatically reject artifactual components without spatial metadata, contributing to robust performance gains. The results suggest NeuroClean as a reproducible, scalable preprocessing foundation for neuroscience and brain‑computer interface research, with potential for broader adoption and further validation across datasets.

Abstract

Electroencephalography (EEG) and local field potentials (LFP) are two widely used techniques to record electrical activity from the brain. These signals are used in both the clinical and research domains for multiple applications. However, most brain data recordings suffer from a myriad of artifacts and noise sources other than the brain itself. Thus, a major requirement for their use is proper and, given current volumes of data, a fully automatized conditioning. As a means to this end, here we introduce an unsupervised, multipurpose EEG/LFP preprocessing method, the NeuroClean pipeline. In addition to its completeness and reliability, NeuroClean is an unsupervised series of algorithms intended to mitigate reproducibility issues and biases caused by human intervention. The pipeline is designed as a five-step process, including the common bandpass and line noise filtering, and bad channel rejection. However, it incorporates an efficient independent component analysis with an automatic component rejection based on a clustering algorithm. This machine learning classifier is used to ensure that task-relevant information is preserved after each step of the cleaning process. We used several data sets to validate the pipeline. NeuroClean removed several common types of artifacts from the signal. Moreover, in the context of motor tasks of varying complexity, it yielded more than 97% accuracy (vs. a chance-level of 33.3%) in an optimized Multinomial Logistic Regression model after cleaning the data, compared to the raw data, which performed at 74% accuracy. These results show that NeuroClean is a promising pipeline and workflow that can be applied to future work and studies to achieve better generalization and performance on machine learning pipelines.

NeuroClean: A Generalized Machine-Learning Approach to Neural Time-Series Conditioning

TL;DR

This work tackles the challenge of automated, reproducible preprocessing of EEG/LFP time‑series by introducing NeuroClean, an unsupervised five‑step pipeline (bandpass filtering, ZapLine line noise removal, bad‑channel rejection, ICA with Cluster‑MARA, and optional epoching). The pipeline is designed to generalize across diverse experimental setups and to reduce human bias, with validation on high‑dimensional macaque LFP motor data showing substantial improvements in downstream classification metrics and spectral fidelity (‑like brain activity). Central to NeuroClean is the Cluster‑MARA method, which uses DBSCAN clustering on MARA features to automatically reject artifactual components without spatial metadata, contributing to robust performance gains. The results suggest NeuroClean as a reproducible, scalable preprocessing foundation for neuroscience and brain‑computer interface research, with potential for broader adoption and further validation across datasets.

Abstract

Electroencephalography (EEG) and local field potentials (LFP) are two widely used techniques to record electrical activity from the brain. These signals are used in both the clinical and research domains for multiple applications. However, most brain data recordings suffer from a myriad of artifacts and noise sources other than the brain itself. Thus, a major requirement for their use is proper and, given current volumes of data, a fully automatized conditioning. As a means to this end, here we introduce an unsupervised, multipurpose EEG/LFP preprocessing method, the NeuroClean pipeline. In addition to its completeness and reliability, NeuroClean is an unsupervised series of algorithms intended to mitigate reproducibility issues and biases caused by human intervention. The pipeline is designed as a five-step process, including the common bandpass and line noise filtering, and bad channel rejection. However, it incorporates an efficient independent component analysis with an automatic component rejection based on a clustering algorithm. This machine learning classifier is used to ensure that task-relevant information is preserved after each step of the cleaning process. We used several data sets to validate the pipeline. NeuroClean removed several common types of artifacts from the signal. Moreover, in the context of motor tasks of varying complexity, it yielded more than 97% accuracy (vs. a chance-level of 33.3%) in an optimized Multinomial Logistic Regression model after cleaning the data, compared to the raw data, which performed at 74% accuracy. These results show that NeuroClean is a promising pipeline and workflow that can be applied to future work and studies to achieve better generalization and performance on machine learning pipelines.

Paper Structure

This paper contains 16 sections, 2 equations, 3 figures.

Figures (3)

  • Figure 1: Schematic of the preprocessing pipeline. It starts with a bandpass filter from 1Hz to 500Hz, followed by a zapline filter to remove power supply artifacts and their harmonics, then a bad channel rejection algorithm is applied, followed by an ICA with ClusterMARA to reject components, and finally the data is epoched to get a structured processed data.
  • Figure 2: A summary schematic of the ClusterMARA algorithm, a modification on the MARA algorithm presented by maraalgo
  • Figure 3: Motor-state classification performance results. Three states were defined and classified using a multinomial logistic regressor model. A: A distribution of training accuracies across one hundred train-test splits for the multinomial logistic regressor (MLR) classifier applied to the data before any preprocessing was performed for all frequency bands and the full dataset. B: The same as A but for to the data after the NeuroClean pipeline was applied. C: Normalized confusion matrices for each frequency band for the raw unprocessed data. D: Same as C but for the data after the NeuroClean pipeline was applied. E: Overall distribution of accuracies after each step of the NeuroClean pipeline (Raw unprocessed data; Bandpassed only; Bandpassed and Zaplined; Bandpassed, ZapLine and with Bad Channel Rejection; fully preprocessed). F: The receiver operating characteristic curves for the MLR classifier for all frequency bands and computed as the micro-averaged One-Vs-Rest per step of the NeuroClean pipeline.