Table of Contents
Fetching ...

SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network

Kexin Wang, Jiahong Zhang, Yong Ren, Man Yao, Di Shang, Bo Xu, Guoqi Li

TL;DR

SpikeVoice introduces Spiking Temporal-Sequential Attention (STSA) to overcome the partial-time dependency of sequence modeling in spiking neurons, enabling high-quality TTS within an SNN. The architecture combines a Spiking Phoneme Encoder, Spiking Variance Adaptor, and Spiking Mel Decoder in a spike-driven, non-autoregressive pipeline, achieving near-ANN speech quality on English and Chinese datasets while consuming only about 10% of ANN energy. Extensive experiments on single- and multi-speaker corpora demonstrate competitive objective metrics and strong subjective MOS, with clear energy-efficiency advantages and insightful visual analyses of spike behavior. This work demonstrates the practicality of energy-efficient TTS in the SNN paradigm and highlights future directions for reducing information loss and accelerating training in spike-based generative models.

Abstract

Brain-inspired Spiking Neural Network (SNN) has demonstrated its effectiveness and efficiency in vision, natural language, and speech understanding tasks, indicating their capacity to "see", "listen", and "read". In this paper, we design \textbf{SpikeVoice}, which performs high-quality Text-To-Speech (TTS) via SNN, to explore the potential of SNN to "speak". A major obstacle to using SNN for such generative tasks lies in the demand for models to grasp long-term dependencies. The serial nature of spiking neurons, however, leads to the invisibility of information at future spiking time steps, limiting SNN models to capture sequence dependencies solely within the same time step. We term this phenomenon "partial-time dependency". To address this issue, we introduce Spiking Temporal-Sequential Attention STSA in the SpikeVoice. To the best of our knowledge, SpikeVoice is the first TTS work in the SNN field. We perform experiments using four well-established datasets that cover both Chinese and English languages, encompassing scenarios with both single-speaker and multi-speaker configurations. The results demonstrate that SpikeVoice can achieve results comparable to Artificial Neural Networks (ANN) with only 10.5 energy consumption of ANN.

SpikeVoice: High-Quality Text-to-Speech Via Efficient Spiking Neural Network

TL;DR

SpikeVoice introduces Spiking Temporal-Sequential Attention (STSA) to overcome the partial-time dependency of sequence modeling in spiking neurons, enabling high-quality TTS within an SNN. The architecture combines a Spiking Phoneme Encoder, Spiking Variance Adaptor, and Spiking Mel Decoder in a spike-driven, non-autoregressive pipeline, achieving near-ANN speech quality on English and Chinese datasets while consuming only about 10% of ANN energy. Extensive experiments on single- and multi-speaker corpora demonstrate competitive objective metrics and strong subjective MOS, with clear energy-efficiency advantages and insightful visual analyses of spike behavior. This work demonstrates the practicality of energy-efficient TTS in the SNN paradigm and highlights future directions for reducing information loss and accelerating training in spike-based generative models.

Abstract

Brain-inspired Spiking Neural Network (SNN) has demonstrated its effectiveness and efficiency in vision, natural language, and speech understanding tasks, indicating their capacity to "see", "listen", and "read". In this paper, we design \textbf{SpikeVoice}, which performs high-quality Text-To-Speech (TTS) via SNN, to explore the potential of SNN to "speak". A major obstacle to using SNN for such generative tasks lies in the demand for models to grasp long-term dependencies. The serial nature of spiking neurons, however, leads to the invisibility of information at future spiking time steps, limiting SNN models to capture sequence dependencies solely within the same time step. We term this phenomenon "partial-time dependency". To address this issue, we introduce Spiking Temporal-Sequential Attention STSA in the SpikeVoice. To the best of our knowledge, SpikeVoice is the first TTS work in the SNN field. We perform experiments using four well-established datasets that cover both Chinese and English languages, encompassing scenarios with both single-speaker and multi-speaker configurations. The results demonstrate that SpikeVoice can achieve results comparable to Artificial Neural Networks (ANN) with only 10.5 energy consumption of ANN.
Paper Structure (18 sections, 7 equations, 6 figures, 7 tables)

This paper contains 18 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The overview model structure of SpikeVoice. In the figure, the left part represents the Spiking Temporal-Sequential Attention (STSA). In the middle part, from bottom to top, are the Spiking Phoneme Encoder (SPE), Spiking Variance Adapter (SVA), and Spiking Mel Decoder (SMD) with the topmost part represents the output Mel-Spectrogram. On the right part, the green module represents the predictor within the Spiking Variance Adapter, the blue module represents Spiking FeedForward, and the orange module indicating Spiking PostNet.
  • Figure 2: The LIF neuron layer.
  • Figure 3: Mel-Spectrograms visualization analysis on English single-speaker dataset LJSpeech.
  • Figure 4: Visualization of spike tensor. Fig.\ref{['subfig21']} and Fig.\ref{['subfig22']} are the spike patterns of STSA in the first layer and the fourth layer. \ref{['subfig23']} and \ref{['subfig24']} denote spike pattern for speech energy and speech pitch. Each dot depicts a fired event.
  • Figure 5: Visualization of spike tensor in the SpikeVoice. Figures in \ref{['encoder1']},\ref{['encoder2']},\ref{['encoder3']},\ref{['encoder4']} are the spike pattern of STSA in Spiking Phoneme Encoder. \ref{['e_out']} and \ref{['p_out']} denote spike pattern for speech energy and speech pitch. Fig.\ref{['decoder1']} to \ref{['decoder6']} are the spike pattern of STSA in Spiking Mel Decoder.
  • ...and 1 more figures