Table of Contents
Fetching ...

SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang

TL;DR

SALMONN-omni tackles the challenge of true standalone full-duplex speech interaction without audio codec injection. It combines a streaming Mamba encoder, a unified LLM backbone, and a CosyVoice2-based streaming synthesizer, governed by an explicit 'thinking' mechanism to predict dialogue state transitions. A three-stage training pipeline, including reinforcement learning with Direct Preference Optimization, yields state-of-the-art turn-taking and robust handling of backchanneling, barge-ins, and echo cancellation, while remaining competitive with larger half-duplex models. The approach reduces system complexity by eliminating external modules and demonstrates practical impact for natural, real-time human-AI speech conversations. Overall, SALMONN-omni advances end-to-end full-duplex speech LLMs with improved timing, stability, and conversational dynamics in diverse scenarios.

Abstract

In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.

SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation

TL;DR

SALMONN-omni tackles the challenge of true standalone full-duplex speech interaction without audio codec injection. It combines a streaming Mamba encoder, a unified LLM backbone, and a CosyVoice2-based streaming synthesizer, governed by an explicit 'thinking' mechanism to predict dialogue state transitions. A three-stage training pipeline, including reinforcement learning with Direct Preference Optimization, yields state-of-the-art turn-taking and robust handling of backchanneling, barge-ins, and echo cancellation, while remaining competitive with larger half-duplex models. The approach reduces system complexity by eliminating external modules and demonstrates practical impact for natural, real-time human-AI speech conversations. Overall, SALMONN-omni advances end-to-end full-duplex speech LLMs with improved timing, stability, and conversational dynamics in diverse scenarios.

Abstract

In order to enable fluid and natural human-machine speech interaction, existing full-duplex conversational systems often adopt modular architectures with auxiliary components such as voice activity detectors, interrupters, conversation state predictors, or multiple LLMs. These systems, however, suffer from error accumulation across modules and struggle with key challenges such as context-dependent barge-in and echo cancellation. Recent approaches, most notably Moshi, simplify the pipeline by injecting audio codecs into the token space of a single LLM. However, such methods still incur significant performance degradation when operating on the speech rather than text modality. In this paper, we introduce SALMONN-omni, the first single, standalone full-duplex speech LLM that operates without audio codecs in its token space. It features a novel dynamic thinking mechanism within the LLM backbone, enabling the model to learn when to transition between speaking and listening states. Experiments on widely used benchmarks for spoken question answering and open-domain dialogue show that SALMONN-omni achieves at least 30\% relative performance improvement over existing open-source full-duplex models and performs highly competitively to half-duplex and turn-based systems, despite using substantially less training data. Moreover, SALMONN-omni demonstrates strong performance in complex conversational scenarios, including turn-taking, backchanneling, echo cancellation and context-dependent barge-in, with further improvements achieved through reinforcement learning. Some demo conversations between user and SALMONN-omni are provided in the following repository https://github.com/bytedance/SALMONN.

Paper Structure

This paper contains 38 sections, 8 figures, 13 tables.

Figures (8)

  • Figure 1: The architecture of SALMONN-omni. Two input streams, the environment stream and the assistant stream, are processed by the streaming speech encoder. Speech embeddings from both streams, along with textual embeddings, are fed into the LLM backbone in an interleaved manner. When in the speaking state, the streaming speech synthesizer takes the textual embeddings derived from the LLM backbone as input to produce speech responses.
  • Figure 2: Illustration of "Implicit" and "Explicit" thinking strategies. The tokens on the top of the LLM are predicted by speech embeddings, while the bottom ones are predicted by textual embeddings and fed back to the input sequences to the LLM.
  • Figure 3: Three-stage training strategy for SALMONN-omni
  • Figure 4: The overall F1 score of SALMONN-omni when trained with different batch sizes during the DPO stage.
  • Figure 5: Spoken QA: SALMONN-omni can handle turn-taking in spoken question answering scenarios with "thinking" mechanism.
  • ...and 3 more figures