Table of Contents
Fetching ...

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, Chao Zhang

TL;DR

SALMONN-omni addresses the need for true full-duplex speech interaction in an end-to-end, codec-free LLM. It fuses a streaming speech encoder, an LLM, and a streaming speech synthesizer with time-block synchronization and a novel thinking mechanism to manage listening and speaking states. The work introduces two special tokens and a loss formulation to train end-to-end without discrete speech codecs, enabling turn-taking, barge-in, and echo cancellation in real-time. It demonstrates versatility across streaming tasks (recognition, enhancement, dereverberation, target speaker extraction, spoken QA) using large-scale LibriHeavy and GigaSpeech data and synthetic scenarios, positioning a strong prototype for future conversational AI.

Abstract

Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking'' mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (quantized speech and audio tokens). Experimental results demonstrate SALMONN-omni's versatility across a broad range of streaming speech tasks, including speech recognition, speech enhancement, and spoken question answering. Additionally, SALMONN-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full-duplex conversational AI systems. To the best of our knowledge, SALMONN-omni is the first codec-free model of its kind. A full technical report along with model checkpoints will be released soon.

SALMONN-omni: A Codec-free LLM for Full-duplex Speech Understanding and Generation

TL;DR

SALMONN-omni addresses the need for true full-duplex speech interaction in an end-to-end, codec-free LLM. It fuses a streaming speech encoder, an LLM, and a streaming speech synthesizer with time-block synchronization and a novel thinking mechanism to manage listening and speaking states. The work introduces two special tokens and a loss formulation to train end-to-end without discrete speech codecs, enabling turn-taking, barge-in, and echo cancellation in real-time. It demonstrates versatility across streaming tasks (recognition, enhancement, dereverberation, target speaker extraction, spoken QA) using large-scale LibriHeavy and GigaSpeech data and synthetic scenarios, positioning a strong prototype for future conversational AI.

Abstract

Full-duplex multimodal large language models (LLMs) provide a unified framework for addressing diverse speech understanding and generation tasks, enabling more natural and seamless human-machine conversations. Unlike traditional modularised conversational AI systems, which separate speech recognition, understanding, and text-to-speech generation into distinct components, multimodal LLMs operate as single end-to-end models. This streamlined design eliminates error propagation across components and fully leverages the rich non-verbal information embedded in input speech signals. We introduce SALMONN-omni, a codec-free, full-duplex speech understanding and generation model capable of simultaneously listening to its own generated speech and background sounds while speaking. To support this capability, we propose a novel duplex spoken dialogue framework incorporating a ``thinking'' mechanism that facilitates asynchronous text and speech generation relying on embeddings instead of codecs (quantized speech and audio tokens). Experimental results demonstrate SALMONN-omni's versatility across a broad range of streaming speech tasks, including speech recognition, speech enhancement, and spoken question answering. Additionally, SALMONN-omni excels at managing turn-taking, barge-in, and echo cancellation scenarios, establishing its potential as a robust prototype for full-duplex conversational AI systems. To the best of our knowledge, SALMONN-omni is the first codec-free model of its kind. A full technical report along with model checkpoints will be released soon.

Paper Structure

This paper contains 5 sections, 1 equation, 9 figures.

Figures (9)

  • Figure 1: SALMONN-omni is a codec-free, full-duplex end-to-end conversational AI model capable of managing dynamic dialogue interactions such as turn-taking and context-dependent barge-in in human-machine conversations.
  • Figure 2: SALMONN-omni is an end-to-end model that integrates a streaming speech encoder, an LLM, and a streaming speech synthesizer, all interconnected through embeddings. It features a novel codec-free, full-duplex spoken dialogue framework enhanced by a "thinking" mechanism.
  • Figure 3: Streaming speech recognition: Without using the speech synthesizer, SALMONN-omni can convert the input speech to text in a streaming way.
  • Figure 4: Speech enhancement: SALMONN-omni improves speech quality by denoising and dereverberation. It listens to the noisy speech and then re-speaks the speech content with enhanced clarity.
  • Figure 5: Spoken question answering: SALMONN-omni can handle turn-taking in spoken question-answering scenarios.
  • ...and 4 more figures