Table of Contents
Fetching ...

PTS-SNN: A Prompt-Tuned Temporal Shift Spiking Neural Networks for Efficient Speech Emotion Recognition

Xun Su, Huamin Wang, Qi Zhang

TL;DR

This work tackles efficient speech emotion recognition on resource-constrained devices by addressing the distribution mismatch between continuous SSL features and discrete spiking dynamics. It presents PTS-SNN, a neuromorphic adaptation framework that freezes a pretrained SSL backbone and couples it with a Temporal Shift Spiking Encoder, Spiking Sparse Linear Attention, and a Context-Aware Membrane Potential Calibration to generate dynamic membrane biases. The approach yields competitive IEMOCAP performance ($WA = 73.34\%$) while using only $1.19$M trainable parameters and $0.35$ mJ per sample, significantly reducing energy and memory overhead relative to state-of-the-art ANN baselines. The results demonstrate the practicality of neuromorphic prompting for edge intelligence in SER, with strong cross-lingual transfer and a clear path to multimodal extensions in future work.

Abstract

Speech Emotion Recognition (SER) is widely deployed in Human-Computer Interaction, yet the high computational cost of conventional models hinders their implementation on resource-constrained edge devices. Spiking Neural Networks (SNNs) offer an energy-efficient alternative due to their event-driven nature; however, their integration with continuous Self-Supervised Learning (SSL) representations is fundamentally challenged by distribution mismatch, where high-dynamic-range embeddings degrade the information coding capacity of threshold-based neurons. To resolve this, we propose Prompt-Tuned Spiking Neural Networks (PTS-SNN), a parameter-efficient neuromorphic adaptation framework that aligns frozen SSL backbones with spiking dynamics. Specifically, we introduce a Temporal Shift Spiking Encoder to capture local temporal dependencies via parameter-free channel shifts, establishing a stable feature basis. To bridge the domain gap, we devise a Context-Aware Membrane Potential Calibration strategy. This mechanism leverages a Spiking Sparse Linear Attention module to aggregate global semantic context into learnable soft prompts, which dynamically regulate the bias voltages of Parametric Leaky Integrate-and-Fire (PLIF) neurons. This regulation effectively centers the heterogeneous input distribution within the responsive firing range, mitigating functional silence or saturation. Extensive experiments on five multilingual datasets (e.g., IEMOCAP, CASIA, EMODB) demonstrate that PTS-SNN achieves 73.34\% accuracy on IEMOCAP, comparable to competitive Artificial Neural Networks (ANNs), while requiring only 1.19M trainable parameters and 0.35 mJ inference energy per sample.

PTS-SNN: A Prompt-Tuned Temporal Shift Spiking Neural Networks for Efficient Speech Emotion Recognition

TL;DR

This work tackles efficient speech emotion recognition on resource-constrained devices by addressing the distribution mismatch between continuous SSL features and discrete spiking dynamics. It presents PTS-SNN, a neuromorphic adaptation framework that freezes a pretrained SSL backbone and couples it with a Temporal Shift Spiking Encoder, Spiking Sparse Linear Attention, and a Context-Aware Membrane Potential Calibration to generate dynamic membrane biases. The approach yields competitive IEMOCAP performance () while using only M trainable parameters and mJ per sample, significantly reducing energy and memory overhead relative to state-of-the-art ANN baselines. The results demonstrate the practicality of neuromorphic prompting for edge intelligence in SER, with strong cross-lingual transfer and a clear path to multimodal extensions in future work.

Abstract

Speech Emotion Recognition (SER) is widely deployed in Human-Computer Interaction, yet the high computational cost of conventional models hinders their implementation on resource-constrained edge devices. Spiking Neural Networks (SNNs) offer an energy-efficient alternative due to their event-driven nature; however, their integration with continuous Self-Supervised Learning (SSL) representations is fundamentally challenged by distribution mismatch, where high-dynamic-range embeddings degrade the information coding capacity of threshold-based neurons. To resolve this, we propose Prompt-Tuned Spiking Neural Networks (PTS-SNN), a parameter-efficient neuromorphic adaptation framework that aligns frozen SSL backbones with spiking dynamics. Specifically, we introduce a Temporal Shift Spiking Encoder to capture local temporal dependencies via parameter-free channel shifts, establishing a stable feature basis. To bridge the domain gap, we devise a Context-Aware Membrane Potential Calibration strategy. This mechanism leverages a Spiking Sparse Linear Attention module to aggregate global semantic context into learnable soft prompts, which dynamically regulate the bias voltages of Parametric Leaky Integrate-and-Fire (PLIF) neurons. This regulation effectively centers the heterogeneous input distribution within the responsive firing range, mitigating functional silence or saturation. Extensive experiments on five multilingual datasets (e.g., IEMOCAP, CASIA, EMODB) demonstrate that PTS-SNN achieves 73.34\% accuracy on IEMOCAP, comparable to competitive Artificial Neural Networks (ANNs), while requiring only 1.19M trainable parameters and 0.35 mJ inference energy per sample.
Paper Structure (18 sections, 21 equations, 5 figures, 5 tables)

This paper contains 18 sections, 21 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of the Distribution Mismatch. The raw Upstream Features (dashed grey curve) exhibit a high dynamic range that extends significantly into the Silence or Saturation zones (red regions), causing information loss. In contrast, the proposed PTS-SNN realigns these features into a Calibrated Distribution (solid blue curve) concentrated within the Responsive Range (green region) around the firing threshold $V_{th}$, thereby maximizing the temporal coding capacity of spiking neurons.
  • Figure 2: Overview of the proposed PTS-SNN framework. The architecture processes features from a frozen backbone through three cascaded stages: the Temporal Shift Spiking Encoder for capturing local dependencies via parameter-free shifts, the SSLA module for aggregating global semantic context, and the SNN Backend for emotion classification. The bottom panel details the SSLA mechanism, where learnable soft prompts interact with sparse spike maps to generate dynamic voltage biases for membrane potential calibration.
  • Figure 3: Temporal Shift operation.Left: The original input feature map with dimensions of Time ($T$), Batch ($B$), and Channel ($C$). Right: The transformed feature map after shifting specific channels along the temporal axis. Empty slots at the top are zero-padded, while the overflow at the bottom is truncated to preserve sequence length, facilitating local information exchange without additional parameters.
  • Figure 4: Parameter sensitivity analysis. (a) Accuracy across different prompt lengths $L_p$. (b) Accuracy variations with respect to the bias parameter $\kappa$ .
  • Figure 5: Comparison of computational efficiency with state-of-the-art baselines. (a) Number of trainable parameters (Log scale). (b) Inference energy consumption per sample (Log scale). The proposed PTS-SNN demonstrates orders of magnitude improvement in both metrics.