Table of Contents
Fetching ...

Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

Kyudan Jung, Jihwan Kim, Soyoon Kim, Jeonghoon Kim, Jaegul Choo, Cheonbok Park

Abstract

As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.

Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models

Abstract

As the paradigm of AI shifts from text-based LLMs to Speech Language Models (SLMs), there is a growing demand for full-duplex systems capable of real-time, natural human-computer interaction. However, the development of such models is constrained by the scarcity of high-quality, multi-speaker conversational data, as existing large-scale resources are predominantly single-speaker or limited in volume. Addressing the complex dynamics of natural dialogue, such as overlapping and back-channeling remains a challenge, with standard processing pipelines suffering from diarization errors and ASR hallucinations. To bridge this gap, we present a robust and scalable open-source data processing pipeline designed for full-duplex model.

Paper Structure

This paper contains 33 sections, 7 figures, 11 tables.

Figures (7)

  • Figure 1: The overall pipeline of the Sommelier conversational audio pre-processing. Blue boxes denote neural model-based components, and a green box represent a algorithmic component.
  • Figure 2: Illustration of the speech overlap separation process. (a) The process of calculating similarity to distinguish speaker identities using arbitrary independent speaker segments. (b) Separating overlapped regions and making identity decisions for candidates based on the similarity calculated in (a). Finally, the separated segments are concatenated with the original segments.
  • Figure 3: WER comparison by method, SIR, and overlap ratio for both speakers. Top: WER as a function of SIR (dB). Bottom: WER as a function of overlap ratio. Methods include Baseline (mixed), Separation, and Oracle. Error bars represent standard deviation.
  • Figure 4: Four ways to handle backchanneling in overlapping speech.
  • Figure 5: Four distinct types of separable cases in overlapping speech.
  • ...and 2 more figures