Table of Contents
Fetching ...

Investigating the Effects of Large-Scale Pseudo-Stereo Data and Different Speech Foundation Model on Dialogue Generative Spoken Language Model

Yu-Kuan Fu, Cheng-Kuang Lee, Hsiu-Hsuan Wang, Hung-yi Lee

TL;DR

This work tackles the scarcity of stereo dialogue data by introducing a pipeline to convert single-channel dialogues into pseudo-stereo data, expanding the training corpus to about 17.6k hours and enabling end-to-end, speech-driven dialogue modeling. It augments the dGSLM framework with diverse pseudo-stereo data and systematically evaluates multiple discrete-unit encoders, finding that an ASR-finetuned HuBERT (HuBERT large ft) yields the best semantic coherence when paired with pseudo-stereo data. The study reports improvements in turn-taking realism and semantic quality, while noting vocoder limitations that affect audio naturalness and the reliability of some encoders. The contributions include a scalable data-generation pipeline, empirical guidance on encoder choices for speech-to-units dialogue modeling, and an open-source pseudo-stereo dataset to accelerate research in speech-based dialogue systems.

Abstract

Recent efforts in Spoken Dialogue Modeling aim to synthesize spoken dialogue without the need for direct transcription, thereby preserving the wealth of non-textual information inherent in speech. However, this approach faces a challenge when speakers talk simultaneously, requiring stereo dialogue data with speakers recorded on separate channels, a notably scarce resource. To address this, we have developed an innovative pipeline capable of transforming single-channel dialogue data into pseudo-stereo data. This expanded our training dataset from a mere 2,000 to an impressive 17,600 hours, significantly enriching the diversity and quality of the training examples available. The inclusion of this pseudo-stereo data has proven to be effective in improving the performance of spoken dialogue language models. Additionally, we explored the use of discrete units of different speech foundation models for spoken dialogue generation.

Investigating the Effects of Large-Scale Pseudo-Stereo Data and Different Speech Foundation Model on Dialogue Generative Spoken Language Model

TL;DR

This work tackles the scarcity of stereo dialogue data by introducing a pipeline to convert single-channel dialogues into pseudo-stereo data, expanding the training corpus to about 17.6k hours and enabling end-to-end, speech-driven dialogue modeling. It augments the dGSLM framework with diverse pseudo-stereo data and systematically evaluates multiple discrete-unit encoders, finding that an ASR-finetuned HuBERT (HuBERT large ft) yields the best semantic coherence when paired with pseudo-stereo data. The study reports improvements in turn-taking realism and semantic quality, while noting vocoder limitations that affect audio naturalness and the reliability of some encoders. The contributions include a scalable data-generation pipeline, empirical guidance on encoder choices for speech-to-units dialogue modeling, and an open-source pseudo-stereo dataset to accelerate research in speech-based dialogue systems.

Abstract

Recent efforts in Spoken Dialogue Modeling aim to synthesize spoken dialogue without the need for direct transcription, thereby preserving the wealth of non-textual information inherent in speech. However, this approach faces a challenge when speakers talk simultaneously, requiring stereo dialogue data with speakers recorded on separate channels, a notably scarce resource. To address this, we have developed an innovative pipeline capable of transforming single-channel dialogue data into pseudo-stereo data. This expanded our training dataset from a mere 2,000 to an impressive 17,600 hours, significantly enriching the diversity and quality of the training examples available. The inclusion of this pseudo-stereo data has proven to be effective in improving the performance of spoken dialogue language models. Additionally, we explored the use of discrete units of different speech foundation models for spoken dialogue generation.
Paper Structure (21 sections, 4 equations, 1 figure, 2 tables)

This paper contains 21 sections, 4 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The pipeline of generating pseudo-stereo data from single-channel dialogue data. We split the process into 3 steps: speaker diarization, source separation, and speaker verification.