Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

Shogo Noguchi; Taketo Akama; Tai Nakamura; Shun Minamikawa; Natalia Polouliakh

Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

Shogo Noguchi, Taketo Akama, Tai Nakamura, Shun Minamikawa, Natalia Polouliakh

TL;DR

It is shown that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification and shows that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding.

Abstract

During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.

Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

TL;DR

Abstract

Paper Structure (8 sections, 32 equations, 9 figures, 1 table)

This paper contains 8 sections, 32 equations, 9 figures, 1 table.

Introduction
Results
Discussion
Neural network architecture
Training schedule and optimization
Ensemble inference
Evaluation protocol
Implementation details

Figures (9)

Figure 1: Conceptual overview of our approach. To extract representations that are encoded in the cortex, we predict corresponding ANN representations that are hypothesized to capture stimulus-related dimensions reflected in cortical activity. As a result, the models capture different aspects of neural information, thereby complementing each other to enhance task-relevant EEG components for song ID classification and improving recognition performance individually.
Figure 2: Classification performance across distinct neural representations and multi-seed baselines. Comparison of classification accuracy between individual seed baselines (trained from scratch with seeds 0, 1, and 42) and models pretrained with three distinct ANN representation types: acoustic features, Surprisal (16 s context), and entropy (16 s context). All three representation-based models consistently outperformed the seed baselines, demonstrating that pretraining with ANN-derived teacher signals---whether acoustic or predictive---enhances EEG-based song ID classification. The acoustic model achieved the highest single-model accuracy (0.859), followed by Surprisal (0.855) and entropy (0.850), each improving over the baseline mean (0.823). This pattern suggests that while all three representation types capture stimulus-related information encoded in EEG, acoustic features provide the strongest supervisory signal for this task.
Figure 3: Single model performance: acoustic representation pretraining vs. non-pretrained baselines. Bar chart comparing classification accuracy between the acoustic feature prediction model and individual seed baselines. The acoustic model achieved 0.859 accuracy, consistently outperforming all three seed-based models: seed 0 (0.832), seed 1 (0.809), and seed 42 (0.827). Statistical annotations display McNemar's test results for each comparison, showing p-values, absolute accuracy improvements, and relative improvement percentages. All comparisons achieved $p<0.001$, confirming that acoustic pretraining provides statistically robust performance gains independent of random initialization.
Figure 4: Context window optimization for prediction-related features. A Line plot showing classification accuracy for Surprisal-based (orange) and entropy-based (green) models across three context window lengths (8 s, 16 s, 32 s). Both predictive features achieved peak performance at the 16-s window (Surprisal: 0.855; Entropy: 0.850), with star markers highlighting this optimal configuration. Accuracy declined at both shorter (8 s) and longer (32 s) windows, suggesting that the 16-s window best captures the temporal scope of predictive processing during naturalistic music listening. B Heatmap displaying McNemar's test results comparing each predictive model configuration (rows) against three seed baselines (columns). Cell colors indicate statistical outcomes: blue denotes that the predictive model significantly outperformed the seed ($p<0.05$), gray indicates no significant difference, and orange indicates seed superiority. Asterisks within cells denote significance levels ($*$$p<0.05$, $**$$p<0.01$, $***$$p<0.001$). Only the 16-s models achieved statistical superiority over all three seeds for both Surprisal and Entropy.
Figure 5: Effectiveness of ensembling acoustic and prediction-related features. A Bar chart comparing classification accuracy across single models, two-model ensembles, and the three-model ensemble. Accuracy increased systematically with ensemble size: single models ranged from 0.850--0.859, two-model combinations achieved 0.879--0.881, and the three-model ensemble reached 0.887, demonstrating progressive performance gains through representation integration. B Heatmap showing McNemar's test results comparing ensembles (rows) against constituent single models (columns). Blue cells indicate that the ensemble significantly outperformed the single model ($p<0.05$), with asterisks denoting significance levels. All ensembles achieved statistical superiority over their constituent single models, confirming synergistic benefits from combining complementary representations. C Heatmap displaying pairwise comparisons between the three-model ensemble (row) and all two-model ensembles (columns). The three-model ensemble significantly outperformed every two-model configuration, demonstrating that each of the three representation types---Acoustic, Surprisal, and Entropy---captures distinct information, as evidenced by their complementary contributions when integrated together. Statistical notation: $*$$p<0.05$, $**$$p<0.01$, $***$$p<0.001$ (McNemar's test).
...and 4 more figures

Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

TL;DR

Abstract

Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

Authors

TL;DR

Abstract

Table of Contents

Figures (9)