Table of Contents
Fetching ...

MOSS-TTS Technical Report

Yitian Gong, Botian Jiang, Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Yaozhou Jiang, Cheng Chang, Dong Hong, Mingshu Chen, Ruixiao Li, Yiyang Zhang, Yang Gao, Hanfu Chen, Ke Chen, Songlin Wang, Xiaogui Yang, Yuqian Zhang, Kexin Huang, ZhengYuan Lin, Kang Yu, Ziqi Chen, Jin Wang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu

Abstract

This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.

MOSS-TTS Technical Report

Abstract

This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.
Paper Structure (50 sections, 16 equations, 6 figures, 7 tables)

This paper contains 50 sections, 16 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Architecture of MOSS-Audio-Tokenizer. Both the encoder and decoder are built upon causal Transformers. All components, including the encoder, quantizer, decoder, decoder-only LLM, and discriminator, are optimized jointly in an end-to-end manner.
  • Figure 2: Architecture of MOSS-TTS. The left panel illustrates the delay pattern as described in Section \ref{['sec:delay_pattern']}, while the right panel depicts the local transformer pattern as detailed in Section \ref{['sec:local_transformer_pattern']}.
  • Figure 3: Overview of the MOSS-TTS pretraining data pipeline, including preprocessing, filtering, and targeted data synthesis.
  • Figure 4: Statistics of the MOSS-TTS pretraining corpus. Panel (a) shows the share of training hours by domain; panel (b) shows the language distribution as a donut chart (English/Chinese/Other) alongside a breakdown of the top minor languages; panel (c) shows the distribution of utterance duration by both hours (bars) and utterance count (line).
  • Figure 5: Comparison of objective reconstruction metrics between MOSS-Audio-Tokenizer and other state-of-the-art open-source audio tokenizers on the LibriSpeech test-clean dataset. Results are evaluated within the 0--4 kbps bitrate range. The horizontal axis represents the bitrate, and the vertical axis denotes the corresponding objective reconstruction scores.
  • ...and 1 more figures