The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin Jin
TL;DR
The paper introduces the Interspeech 2024 Challenge on Speech Processing Using Discrete Speech Units, proposing three tasks—multilingual ASR, TTS, and SVS—to benchmark discrete unit representations for speech. It formalizes discretization and bitrate, and provides baseline systems and evaluation protocols across LibriSpeech/ML-SUPERB for ASR, Expresso for TTS vocoding, LJSpeech for cascaded TTS, and Opencpop for SVS. Preliminary results show SSL-based discrete units boost multilingual ASR and SVS, while neural-codec–based approaches yield high-quality TTS outputs, though often with higher bitrate, underscoring trade-offs between efficiency and quality. Overall, the challenge establishes a public benchmark suite and initial insights into how discrete speech units can unify and accelerate speech processing across modalities, with clear directions for future research.
Abstract
Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compression and restoration, speech recognition, and speech generation. To foster exploration in this domain, we introduce the Interspeech 2024 Challenge, which focuses on new speech processing benchmarks using discrete units. It encompasses three pivotal tasks, namely multilingual automatic speech recognition, text-to-speech, and singing voice synthesis, and aims to assess the potential applicability of discrete units in these tasks. This paper outlines the challenge designs and baseline descriptions. We also collate baseline and selected submission systems, along with preliminary findings, offering valuable contributions to future research in this evolving field.
