S$^2$M-Former: Spiking Symmetric Mixing Branchformer for Brain Auditory Attention Detection
Jiaqi Wang, Zhengyu Ma, Xiongri Shen, Chenlin Zhou, Leilei Zhao, Han Zhang, Yi Zhong, Siqi Cai, Zhenxi Song, Zhiguo Zhang
TL;DR
S2M-Former tackles EEG-based auditory attention detection under strict energy constraints by introducing a spike-driven, symmetric two-branch architecture that processes spatial and frequency features via lightweight 1D tokens. The model integrates branch-specific spiking encoders with a stacked S$^2$M block comprising SCSA, SMSC, SGCM, and MPTM to enable complementary learning and robust fusion. It achieves substantial parameter and energy reductions (up to $14.7\times$ fewer parameters and $5.8\times$ less energy) while maintaining competitive SOTA accuracy across KUL, DTU, and AV-GC-AAD in within-trial, cross-trial, and cross-subject settings, demonstrating strong generalization and suitability for neuromorphic AAD. The work highlights the practical potential of energy-efficient, brain-inspired AAD for neuro-steered hearing devices and lays groundwork for hardware-near implementations on neuromorphic platforms.
Abstract
Auditory attention detection (AAD) aims to decode listeners' focus in complex auditory environments from electroencephalography (EEG) recordings, which is crucial for developing neuro-steered hearing devices. Despite recent advancements, EEG-based AAD remains hindered by the absence of synergistic frameworks that can fully leverage complementary EEG features under energy-efficiency constraints. We propose S$^2$M-Former, a novel spiking symmetric mixing framework to address this limitation through two key innovations: i) Presenting a spike-driven symmetric architecture composed of parallel spatial and frequency branches with mirrored modular design, leveraging biologically plausible token-channel mixers to enhance complementary learning across branches; ii) Introducing lightweight 1D token sequences to replace conventional 3D operations, reducing parameters by 14.7$\times$. The brain-inspired spiking architecture further reduces power consumption, achieving a 5.8$\times$ energy reduction compared to recent ANN methods, while also surpassing existing SNN baselines in terms of parameter efficiency and performance. Comprehensive experiments on three AAD benchmarks (KUL, DTU and AV-GC-AAD) across three settings (within-trial, cross-trial and cross-subject) demonstrate that S$^2$M-Former achieves comparable state-of-the-art (SOTA) decoding accuracy, making it a promising low-power, high-performance solution for AAD tasks. Code is available at https://github.com/JackieWang9811/S2M-Former.
