Table of Contents
Fetching ...

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

Shengpeng Ji, Ziyue Jiang, Hanting Wang, Jialong Zuo, Zhou Zhao

TL;DR

MobileSpeech introduces the first zero-shot TTS system designed for mobile deployment, addressing the need for fast, lightweight, and robust voice cloning of unseen speakers. It leverages a discrete acoustic codec with a Speech Codec Mask Decoder (SMD) and a Speaker Prompt module to enable parallel, masked generation across codec channels, significantly reducing latency compared with autoregressive approaches. The model integrates text, prompt speech, and fine-grained duration prompts via cross-attention and duration extraction, all trained with multi-channel RVQ tokens and masked generation objectives. On LibriSpeech and Mandarin datasets, MobileSpeech achieves state-of-the-art inference speed (RTF around 0.09 on A100) and strong speech quality and speaker similarity, while being deployable on mobile devices, thereby enabling real-time, on-device zero-shot TTS with broad language coverage and robustness.

Abstract

Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, model size, and robustness. Therefore, we propose MobileSpeech, which is a fast, lightweight, and robust zero-shot text-to-speech system based on mobile devices for the first time. Specifically: 1) leveraging discrete codec, we design a parallel speech mask decoder module called SMD, which incorporates hierarchical information from the speech codec and weight mechanisms across different codec layers during the generation process. Moreover, to bridge the gap between text and speech, we introduce a high-level probabilistic mask that simulates the progression of information flow from less to more during speech generation. 2) For speaker prompts, we extract fine-grained prompt duration from the prompt speech and incorporate text, prompt speech by cross attention in SMD. We demonstrate the effectiveness of MobileSpeech on multilingual datasets at different levels, achieving state-of-the-art results in terms of generating speed and speech quality. MobileSpeech achieves RTF of 0.09 on a single A100 GPU and we have successfully deployed MobileSpeech on mobile devices. Audio samples are available at \url{https://mobilespeech.github.io/} .

MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech

TL;DR

MobileSpeech introduces the first zero-shot TTS system designed for mobile deployment, addressing the need for fast, lightweight, and robust voice cloning of unseen speakers. It leverages a discrete acoustic codec with a Speech Codec Mask Decoder (SMD) and a Speaker Prompt module to enable parallel, masked generation across codec channels, significantly reducing latency compared with autoregressive approaches. The model integrates text, prompt speech, and fine-grained duration prompts via cross-attention and duration extraction, all trained with multi-channel RVQ tokens and masked generation objectives. On LibriSpeech and Mandarin datasets, MobileSpeech achieves state-of-the-art inference speed (RTF around 0.09 on A100) and strong speech quality and speaker similarity, while being deployable on mobile devices, thereby enabling real-time, on-device zero-shot TTS with broad language coverage and robustness.

Abstract

Zero-shot text-to-speech (TTS) has gained significant attention due to its powerful voice cloning capabilities, requiring only a few seconds of unseen speaker voice prompts. However, all previous work has been developed for cloud-based systems. Taking autoregressive models as an example, although these approaches achieve high-fidelity voice cloning, they fall short in terms of inference speed, model size, and robustness. Therefore, we propose MobileSpeech, which is a fast, lightweight, and robust zero-shot text-to-speech system based on mobile devices for the first time. Specifically: 1) leveraging discrete codec, we design a parallel speech mask decoder module called SMD, which incorporates hierarchical information from the speech codec and weight mechanisms across different codec layers during the generation process. Moreover, to bridge the gap between text and speech, we introduce a high-level probabilistic mask that simulates the progression of information flow from less to more during speech generation. 2) For speaker prompts, we extract fine-grained prompt duration from the prompt speech and incorporate text, prompt speech by cross attention in SMD. We demonstrate the effectiveness of MobileSpeech on multilingual datasets at different levels, achieving state-of-the-art results in terms of generating speed and speech quality. MobileSpeech achieves RTF of 0.09 on a single A100 GPU and we have successfully deployed MobileSpeech on mobile devices. Audio samples are available at \url{https://mobilespeech.github.io/} .
Paper Structure (23 sections, 8 equations, 4 figures, 6 tables)

This paper contains 23 sections, 8 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The overall architecture for MobileSpeech. In Figure (a) the Duration Extractor module is responsible for extracting prompt duration tokens from prompt acoustic tokens. SMD represents the generative module for target acoustic tokens. In Figure (b), we provide a detailed depiction of the multi-channel training process employed by the SMD module.
  • Figure 2: The process of obtaining target duration tokens from target text tokens and prompt speech tokens is depicted in the following manner: the blue module represents the Prompt Duration Extractor, while the green module represents the Duration Predictor.
  • Figure 3: Based on the RVQ structure, different articles adopt distinct modeling approaches for discrete codecs, where (a) SpearTTS speartts, (b) VALL-E valle, (c) Uniaudio uniaudio and (d) MobileSpeech.
  • Figure 4: MOS evaluation procedure