Table of Contents
Fetching ...

TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch

Xingchen Song, Chengdong Liang, Binbin Zhang, Pengshen Zhang, ZiYu Wang, Youcheng Ma, Menglong Xu, Lin Wang, Di Wu, Fuping Pan, Dinghao Zhou, Zhendong Peng

TL;DR

TouchASP tackles the cost and rigidity of large ASR models by combining elastic MoE (eMoE), a scalable web-data pipeline, and a multi-task Automatic Speech Perception (ASP) framework. It enables train-once, elastic inference across devices with varying resources and achieves CER reductions on SpeechIO to around $2.45\%$ using 1M hours of data. The approach extends beyond ASR to multilingual, dialect, emotion, gender, and 70+ sound-event perception, demonstrated via strong language identification, SER, and SED results, highlighting practical benefits for edge deployment and universal audio understanding. Limitations include reliance on encoder-decoder architectures and predefined tasks, with future work targeting integration with multilingual LLMs and end-to-end interaction models like TouchChat or TouchTTS.

Abstract

Large Automatic Speech Recognition (ASR) models demand a vast number of parameters, copious amounts of data, and significant computational resources during the training process. However, such models can merely be deployed on high-compute cloud platforms and are only capable of performing speech recognition tasks. This leads to high costs and restricted capabilities. In this report, we initially propose the elastic mixture of the expert (eMoE) model. This model can be trained just once and then be elastically scaled in accordance with deployment requirements. Secondly, we devise an unsupervised data creation and validation procedure and gather millions of hours of audio data from diverse domains for training. Using these two techniques, our system achieves elastic deployment capabilities while reducing the Character Error Rate (CER) on the SpeechIO testsets from 4.98\% to 2.45\%. Thirdly, our model is not only competent in Mandarin speech recognition but also proficient in multilingual, multi-dialect, emotion, gender, and sound event perception. We refer to this as Automatic Speech Perception (ASP), and the perception results are presented in the experimental section.

TouchASP: Elastic Automatic Speech Perception that Everyone Can Touch

TL;DR

TouchASP tackles the cost and rigidity of large ASR models by combining elastic MoE (eMoE), a scalable web-data pipeline, and a multi-task Automatic Speech Perception (ASP) framework. It enables train-once, elastic inference across devices with varying resources and achieves CER reductions on SpeechIO to around using 1M hours of data. The approach extends beyond ASR to multilingual, dialect, emotion, gender, and 70+ sound-event perception, demonstrated via strong language identification, SER, and SED results, highlighting practical benefits for edge deployment and universal audio understanding. Limitations include reliance on encoder-decoder architectures and predefined tasks, with future work targeting integration with multilingual LLMs and end-to-end interaction models like TouchChat or TouchTTS.

Abstract

Large Automatic Speech Recognition (ASR) models demand a vast number of parameters, copious amounts of data, and significant computational resources during the training process. However, such models can merely be deployed on high-compute cloud platforms and are only capable of performing speech recognition tasks. This leads to high costs and restricted capabilities. In this report, we initially propose the elastic mixture of the expert (eMoE) model. This model can be trained just once and then be elastically scaled in accordance with deployment requirements. Secondly, we devise an unsupervised data creation and validation procedure and gather millions of hours of audio data from diverse domains for training. Using these two techniques, our system achieves elastic deployment capabilities while reducing the Character Error Rate (CER) on the SpeechIO testsets from 4.98\% to 2.45\%. Thirdly, our model is not only competent in Mandarin speech recognition but also proficient in multilingual, multi-dialect, emotion, gender, and sound event perception. We refer to this as Automatic Speech Perception (ASP), and the perception results are presented in the experimental section.

Paper Structure

This paper contains 11 sections, 5 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: TouchASP: Train once, Elastic inference with multiple speech perception capabilities
  • Figure 2: Illustration of Elastic MoE (eMoE). Subfigure (a) showcases the conventional top-K routing ($K=2$) among N experts ($N=8$). Subfigure (b) illustrates the fine-grained expert segmentation and the shared expert isolation strategy that has been originally proposed in DeepSeekMoE, with one shared expert ($S=1$) as an illustrative example. Subsequently, subfigure (c) demonstrates the integration of DeepSeekMoE and our dynamic training approach, constituting the complete eMoE architecture. It is noteworthy that across these three architectures, the number of expert parameters and computational costs remain constant.
  • Figure 3: Overview of Our Data Pipeline
  • Figure 4: Overview of model structure and multi-task training. The encoder-decoder model is trained on many different speech tasks. All these tasks are represented by a token sequence predicted by the decoder, allowing a single model to replace different stages in the traditional speech processing pipeline. The multi-task training format uses a set of special tokens as task instructions or classification targets, as explained in Section \ref{['sec:gasp']}.
  • Figure 5: Typical test sets CER on SpeechIO when data scales. SPEECHIO_ASR_ZH00004 is a simple talk testset, SPEECHIO_ASR_ZH00010 is a medium difficulty interview testset, and SPEECHIO_ASR_ZH00015 is a difficult story imitation testset.