Table of Contents
Fetching ...

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

Xingjian Zhao, Zhe Xu, Qinyuan Cheng, Zhaoye Fei, Luozhijie Jin, Yang Wang, Hanfu Chen, Yaozhou Jiang, Qinghui Gao, Ke Chen, Ruixiao Li, Mingshu Chen, Ruiming Wang, Wenbo Zhang, Yiyang Zhang, Donghua Yu, Yang Gao, Xiaogui Yang, Yitian Gong, Yuanfan Xu, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu

TL;DR

Cascaded spoken dialogue systems that rely on text intermediates lose paralinguistic cues and incur latency. MOSS-Speech introduces a true speech-to-speech LLM by using a modality-based layer split and frozen pre-training to transplant linguistic knowledge from a pretrained text LLM into native speech understanding and generation. The approach employs a two-stage pre-training on large-scale speech data with alignment to the text backbone, a semantic speech tokenizer with streaming-capable encoder/decoder, and synthetic supervised fine-tuning data to preserve text abilities while learning speech. It achieves state-of-the-art results on spoken question answering and competitive speech-to-speech performance relative to text-guided systems, demonstrating a practical, end-to-end expressive dialogue paradigm that narrows the gap between spoken and written interaction.

Abstract

Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.

MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

TL;DR

Cascaded spoken dialogue systems that rely on text intermediates lose paralinguistic cues and incur latency. MOSS-Speech introduces a true speech-to-speech LLM by using a modality-based layer split and frozen pre-training to transplant linguistic knowledge from a pretrained text LLM into native speech understanding and generation. The approach employs a two-stage pre-training on large-scale speech data with alignment to the text backbone, a semantic speech tokenizer with streaming-capable encoder/decoder, and synthetic supervised fine-tuning data to preserve text abilities while learning speech. It achieves state-of-the-art results on spoken question answering and competitive speech-to-speech performance relative to text-guided systems, demonstrating a practical, end-to-end expressive dialogue paradigm that narrows the gap between spoken and written interaction.

Abstract

Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.

Paper Structure

This paper contains 40 sections, 7 equations, 32 figures, 7 tables.

Figures (32)

  • Figure 1: Paradigms for spoken dialogue modeling. (a) Cascaded pipelines rely on ASR → LLM → TTS, discarding paralinguistic cues. (b) Text-guided speech models incorporate speech input but still depend on text as an intermediate during generation. (c) True speech-to-speech language models directly comprehend and produce speech, avoiding the text bottleneck.
  • Figure 2: Visualization of the layer-wise similarity between speech and text representations. (a)–(d) Cosine similarity heatmaps at representative layers (0, 10, 24 and 27) reveal how cross-modal alignment evolves across the model depth. The yellow dots are the points selected by DTW sampling based on similarity. It can be seen that the points selected by our evaluation method largely coincide with the points of high similarity. The whole cosine similarity figure of 28 layers will be posted in Appendix \ref{['app: heatmaps of embedding and 28 layers']}. (e) Similarity score across all layers on five samples shows a progressive increase up to around layer 10, then there are slight fluctuations in the subsequent 14 layers, followed by a noticeable decline in the final layers. This trend indicates that speech and text representations become gradually fused in the lower-to-middle layers but diverge again at the top layers.Content of samples will be provided in Appendix \ref{['app: random five samples']}.
  • Figure 3: Model architecture and training strategy. We split the trailing Transformer layers based on modality, and freeze the text backbone during Stage I pre-training. Both branches are initialized from the same pretrained text model backbone.
  • Figure 4: Embedding
  • Figure 5: Layer 0
  • ...and 27 more figures