Table of Contents
Fetching ...

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

Wenxi Chen, Ziyang Ma, Ruiqi Yan, Yuzhe Liang, Xiquan Li, Ruiyang Xu, Zhikang Niu, Yanqiao Zhu, Yifan Yang, Zhanxun Liu, Kai Yu, Yuxuan Hu, Jinyu Li, Yan Lu, Shujie Liu, Xie Chen

TL;DR

SLAM-Omni delivers a timbre-controllable, end-to-end spoken dialogue system trained in a single stage, introducing semantic-group modeling to shorten audio token sequences and historical text prompting to compress dialogue history. By decoupling speaker timbre into a vocoder and using a flow-matching vocoder pipeline, it achieves strong acoustic quality and speech-text alignment without ASR/TTS pre-training, while supporting multilingual and multi-turn interactions. Evaluations across English and Chinese benchmarks show competitive performance on understanding, reasoning, and conversation, with notable gains in UTMOS and WER metrics. The approach promises efficient, real-time voice interaction with zero-shot timbre control, though it acknowledges limitations in preserving rich non-verbal context from long histories and the need to explore scaling to larger LLM backbones.

Abstract

Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training

TL;DR

SLAM-Omni delivers a timbre-controllable, end-to-end spoken dialogue system trained in a single stage, introducing semantic-group modeling to shorten audio token sequences and historical text prompting to compress dialogue history. By decoupling speaker timbre into a vocoder and using a flow-matching vocoder pipeline, it achieves strong acoustic quality and speech-text alignment without ASR/TTS pre-training, while supporting multilingual and multi-turn interactions. Evaluations across English and Chinese benchmarks show competitive performance on understanding, reasoning, and conversation, with notable gains in UTMOS and WER metrics. The approach promises efficient, real-time voice interaction with zero-shot timbre control, though it acknowledges limitations in preserving rich non-verbal context from long histories and the need to explore scaling to larger LLM backbones.

Abstract

Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.

Paper Structure

This paper contains 39 sections, 5 equations, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Illustration of existing end-to-end spoken dialogue modeling. (a): Text-driven modeling. (b): Interleaved audio-text modeling. (c): Parallel audio-text modeling.
  • Figure 2: Overview of SLAM-Omni. System prompt, historical text prompt, followed by user speech embedding are concatenated as input for multi-turn voice interaction, while speaker prompt controls timbre using the vocoder. Semantic group modeling is used to accelerate speech token synthesis in the autoregressive language model.
  • Figure 3: Illustration of semantic group modeling with $G = 3$. At each step of the autoregressive process, embeddings of grouped semantic tokens and text tokens are aggregated as the input to the LLMs.
  • Figure 4: Illustration of the key-value cache mechanism in Historical Text Prompting for multi-round dialogue.
  • Figure 5: Training accuracy of the next text token prediction during ASR pre-training.
  • ...and 1 more figures