Table of Contents
Fetching ...

Synthetic Data Generation for Classifying Electrophysiological and Morpho-Electrophysiological Neurons from Mouse Visual Cortex

Xavier Vasques, Laura Cif

TL;DR

This study benchmarks classical and deep generative augmentation methods for classifying Allen electrophysiology-defined e-types in the mouse visual cortex, comparing E→e-type and M+E→e-type tasks. It finds that SMOTE offers the most robust gains, especially when augmentation is applied in the native high-dimensional feature space, with hold-out accuracy rising to roughly 0.72–0.76 for E→e-type and 0.85–0.90 for M+E→e-type; deep generative models provide moderate, context-dependent improvements. A biologically anchored fidelity framework using KS tests, MAE, Euclidean distances, and a Mann–Whitney variability baseline shows SMOTE-generated samples reside within biologically plausible diversity, highlighting persistent challenges for rare/inhibitory subclasses. The results give practical guidance for scalable neuron-type classification and point to future work on reduction-aware generative models and targeted data collection to improve fidelity for hard cases. Overall, the work supports synthetic augmentation as a complementary tool to multimodal neuronal mapping, enabling more robust classification while preserving biological interpretability.

Abstract

The accurate classification of neuronal cell types is central to decoding brain function, yet remains hindered by data scarcity and cellular heterogeneity. Here, we benchmarked classical and deep generative synthetic data augmentation strategies -- including SMOTE, GANs, VAEs, Normalizing Flows, and DDPMs -- for supervised classification of both electrophysiological (e-type) and morpho-electrophysiological (mee-type) neuron types from the mouse visual cortex. Using a curated dataset annotated with 48 electrophysiological and 24 morphological features, we established baseline classifiers and introduced synthetic data generated by each method. Our results demonstrate that SMOTE-based augmentation yields the highest classification accuracies (absolute gains of 0.16 for e-types, 0.12 for mee-types), outperforming deep generative models. GANs approached similar performance when hyperparameters and sample sizes were optimized, but were more sensitive to model specification. In addition, we benchmarked synthetic neuron fidelity by comparing mean absolute errors between synthetic and real class profiles against the natural phenotypic variability observed between real neuronal classes.

Synthetic Data Generation for Classifying Electrophysiological and Morpho-Electrophysiological Neurons from Mouse Visual Cortex

TL;DR

This study benchmarks classical and deep generative augmentation methods for classifying Allen electrophysiology-defined e-types in the mouse visual cortex, comparing E→e-type and M+E→e-type tasks. It finds that SMOTE offers the most robust gains, especially when augmentation is applied in the native high-dimensional feature space, with hold-out accuracy rising to roughly 0.72–0.76 for E→e-type and 0.85–0.90 for M+E→e-type; deep generative models provide moderate, context-dependent improvements. A biologically anchored fidelity framework using KS tests, MAE, Euclidean distances, and a Mann–Whitney variability baseline shows SMOTE-generated samples reside within biologically plausible diversity, highlighting persistent challenges for rare/inhibitory subclasses. The results give practical guidance for scalable neuron-type classification and point to future work on reduction-aware generative models and targeted data collection to improve fidelity for hard cases. Overall, the work supports synthetic augmentation as a complementary tool to multimodal neuronal mapping, enabling more robust classification while preserving biological interpretability.

Abstract

The accurate classification of neuronal cell types is central to decoding brain function, yet remains hindered by data scarcity and cellular heterogeneity. Here, we benchmarked classical and deep generative synthetic data augmentation strategies -- including SMOTE, GANs, VAEs, Normalizing Flows, and DDPMs -- for supervised classification of both electrophysiological (e-type) and morpho-electrophysiological (mee-type) neuron types from the mouse visual cortex. Using a curated dataset annotated with 48 electrophysiological and 24 morphological features, we established baseline classifiers and introduced synthetic data generated by each method. Our results demonstrate that SMOTE-based augmentation yields the highest classification accuracies (absolute gains of 0.16 for e-types, 0.12 for mee-types), outperforming deep generative models. GANs approached similar performance when hyperparameters and sample sizes were optimized, but were more sensitive to model specification. In addition, we benchmarked synthetic neuron fidelity by comparing mean absolute errors between synthetic and real class profiles against the natural phenotypic variability observed between real neuronal classes.

Paper Structure

This paper contains 19 sections, 13 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of datasets, preprocessing, and synthetic feature generation for e-type classification. (A) Neuron collection and label space. The E dataset comprises 1,857 neurons with 48 intrinsic electrophysiological features; the M+E dataset comprises 451 neurons with both electrophysiological (n=48) and morphology-derived (n=24) features. In all analyses, the labels are the 17 electrophysiology-defined neuron classes (e-types) from the Allen Cell Types database (4 excitatory classes, Exc_1--Exc_4; 13 inhibitory classes, Inh_1--Inh_13), as in Gouwens et al. (Gouwens et al., 2019). Excitatory vs inhibitory counts are indicated for each dataset. Morphology is used only as additional input in the M+E$\to$e-type task; the label space is identical in both settings. (B) Preprocessing and feature engineering. For each dataset, we applied explicit feature curation to remove strictly technical variables, followed by a fixed ColumnTransformer that performs family-specific scaling (RobustScaler for timing features, StandardScaler for voltage and other continuous variables, no rescaling for ratios/indices, and StandardScaler for morphology metrics), and an optional dimensionality reduction step (PCA, ICA, TruncatedSVD, Isomap, LLE, NCA, or no reduction). Combinations of scaler + reducer define the preprocessing pipelines evaluated downstream. (C) Classification baselines and synthetic feature generation. Each preprocessing pipeline was combined with multiple supervised classifiers (SVM, multilayer perceptron, decision tree, random forest, Gaussian Naïve Bayes, linear discriminant analysis, logistic regression) and evaluated by stratified cross-validation to select robust baselines for E$\to$e-type and M+E$\to$e-type. Preprocessing + classifier configurations were then reused to train models on augmented training sets in which real neurons were supplemented with synthetic samples generated by SMOTE, DDPM, normalizing flows, GANs, or VAEs. In all cases, final performance was assessed on the same held-out real test set, allowing us to quantify the impact of synthetic data on the recovery of e-type classes.
  • Figure 2: Example neurons, features, and class structure used for supervised classification. (A) Representative excitatory and inhibitory neurons from the Allen Cell Types Database with both electrophysiological and morphological recordings. Left: spiny pyramidal cell from primary visual cortex layer 4 (VISp4, e-type Exc_3), shown as a projected image stack (left panel) and 3D reconstruction (right panel; soma in gray, dendrites in red, axon in blue). Right: aspiny interneuron from VISp layer 2/3 (VISp2/3, e-type Inh_1) shown with the same visualizations. These examples illustrate the contrasting dendritic and axonal arborizations of excitatory versus inhibitory neurons in mouse visual cortex. (B) Subset of electrophysiological and morphological features used as model inputs for the two example cells in panel A. For each neuron, we report key intrinsic properties derived from standardized current-clamp protocols (e.g. resting membrane potential vrest, input resistance, sag ratio, membrane time constant $\tau$, mean inter-spike interval avg_isi, and spike-frequency adaptation index), together with summary morphology metrics (e.g. maximal Euclidean path length from soma, number of primary stems and bifurcations, average contraction, and parent--daughter diameter ratio). These features are directly drawn from the Allen Cell Types feature matrix and form part of the 48 electrophysiological and 24 morphological variables used throughout the study. (C) Electrophysiological class labels (e-types) used as prediction targets. The 17 classes comprise 4 excitatory clusters (Exc_1--Exc_4) and 13 inhibitory clusters (Inh_1--Inh_13), each summarized by a plain-English descriptor of its firing pattern and intrinsic properties (e.g. regular spiking transient/adapting, irregular spiking, fast-spiking sustained, FS delay/pause). The right columns report the number of neurons per class in the training and test sets for the electrophysiology-only (E$\to$e-type) and morpho-electrophysiology (M+E$\to$e-type) datasets.
  • Figure 3: Baseline performance landscape and learning curves for neuronal subtype classification.
  • Figure 4: Holdout Accuracy for neuronal subtype classification with synthetic data augmentation. Performance is shown for electrophysiology-only neurons (E$\to$e-type, top row) and multimodal morphology+electrophysiology neurons (M+E$\to$e-type, bottom row), across increasing amounts of synthetic data (N = 0, 100, 500, 1000, 5000, 10000). Left panels display results obtained in the native feature space without dimensionality reduction; right panels show results when PCA, ICA, SVD, or NCA were applied before classification. Each curve corresponds to one augmentation method (SMOTE, VAE, GAN, Normalizing Flow, DDPM).
  • Figure 5: Merged feature-wise kernel density estimates comparing real training data (blue), real test data (red), and synthetic neurons generated with SMOTE at 5,000 samples per class (green) for both electrophysiological (E$\to$e-type; Panel A) and morpho-electrophysiological datasets (M+E$\to$e-type; Panel B). Each subplot represents the empirical distribution of a single electrophysiological or morphological feature.
  • ...and 1 more figures