Table of Contents
Fetching ...

Measuring Robustness of Speech Recognition from MEG Signals Under Distribution Shift

Sheng-You Chien, Bo-Yi Mao, Yi-Ning Chang, Po-Chih Kuo

Abstract

This study investigates robust speech-related decoding from non-invasive MEG signals using the LibriBrain phoneme-classification benchmark from the 2025 PNPL competition. We compare residual convolutional neural networks (CNNs), an STFT-based CNN, and a CNN--Transformer hybrid, while also examining the effects of group averaging, label balancing, repeated grouping, normalization strategies, and data augmentation. Across our in-house implementations, preprocessing and data-configuration choices matter more than additional architectural complexity, among which instance normalization emerges as the most influential modification for generalization. The strongest of our own models, a CNN with group averaging, label balancing, repeated grouping, and instance normalization, achieves 60.95% F1-macro on the test split, compared with 39.53% for the plain CNN baseline. However, most of our models, without instance normalization, show substantial validation-to-test degradation, indicating that distribution shift induced by different normalization statistics is a major obstacle to generalization in our experiments. By contrast, MEGConformer maintains 64.09% F1-macro on both validation and test, and saliency-map analysis is qualitatively consistent with this contrast: weaker models exhibit more concentrated or repetitive phoneme-sensitive patterns across splits, whereas MEGConformer appears more distributed. Overall, the results suggest that improving the reliability of non-invasive phoneme decoding will likely require better handling of normalization-related distribution shift while also addressing the challenge of single-trial decoding.

Measuring Robustness of Speech Recognition from MEG Signals Under Distribution Shift

Abstract

This study investigates robust speech-related decoding from non-invasive MEG signals using the LibriBrain phoneme-classification benchmark from the 2025 PNPL competition. We compare residual convolutional neural networks (CNNs), an STFT-based CNN, and a CNN--Transformer hybrid, while also examining the effects of group averaging, label balancing, repeated grouping, normalization strategies, and data augmentation. Across our in-house implementations, preprocessing and data-configuration choices matter more than additional architectural complexity, among which instance normalization emerges as the most influential modification for generalization. The strongest of our own models, a CNN with group averaging, label balancing, repeated grouping, and instance normalization, achieves 60.95% F1-macro on the test split, compared with 39.53% for the plain CNN baseline. However, most of our models, without instance normalization, show substantial validation-to-test degradation, indicating that distribution shift induced by different normalization statistics is a major obstacle to generalization in our experiments. By contrast, MEGConformer maintains 64.09% F1-macro on both validation and test, and saliency-map analysis is qualitatively consistent with this contrast: weaker models exhibit more concentrated or repetitive phoneme-sensitive patterns across splits, whereas MEGConformer appears more distributed. Overall, the results suggest that improving the reliability of non-invasive phoneme decoding will likely require better handling of normalization-related distribution shift while also addressing the challenge of single-trial decoding.

Paper Structure

This paper contains 50 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Raw phoneme-class distribution for the train and validation splits.
  • Figure 2: Overall model architecture. (a) CNN backbone. (b) STFT-CNN. (c) CNN-Transformer hybrid. For all convolutional blocks, the stride $S$ is set to $1$, and padding $P$ is chosen to preserve the temporal resolution unless otherwise specified.
  • Figure 3: Row-wise normalized layer saliency maps for MEGConformer across validation and test splits. The saliency is visually stable across splits and distributed across encoder blocks, matching the strong validation--test transfer reported in Table \ref{['tab:summary_results']}.
  • Figure 4: Row-wise normalized layer saliency maps for the three models for which both standard and InstanceNorm variants are available. In each 3-column block, the upper row shows validation maps and the lower row shows test maps. Each row within a saliency map corresponds to a trainable sublayer, and each column corresponds to a phoneme class. The color intensity indicates the relative saliency of each phoneme for that layer, normalized within the row.
  • Figure 5: Layer-by-phoneme validation-test saliency similarity matrices for MEGConformer. Both metrics show near-uniformly high correspondence across layers and phoneme classes.
  • ...and 1 more figures