Table of Contents
Fetching ...

Building a Taiwanese Mandarin Spoken Language Model: A First Attempt

Chih-Kai Yang, Yu-Kuan Fu, Chen-An Li, Yi-Cheng Lin, Yu-Xiang Lin, Wei-Chih Chen, Ho Lam Chung, Chun-Yi Kuan, Wei-Ping Huang, Ke-Han Lu, Tzu-Quan Lin, Hsiu-Hsuan Wang, En-Pei Hu, Chan-Jan Hsu, Liang-Hsuan Tseng, I-Hsiang Chiu, Ulin Sanga, Xuanjun Chen, Po-chun Hsu, Shu-wen Yang, Hung-yi Lee

TL;DR

This paper reports an initial attempt to build a Taiwanese Mandarin spoken LLM with real-time speech-to-speech capabilities. It uses an end-to-end decoder-only transformer initialized from a text LLM, integrating streaming ASR, a speech encoder, and a diffusion-based speech decoder to enable full-duplex dialogue. The training combines extensive pre-training and supervised fine-tuning on a mix of real and synthetic dialogues, with a modality-control prompt scheme enabling text, unit, or hybrid inputs and outputs. An evaluation platform, including multi-turn dialogue assessments and a Forum of Spoken Agents, measures coherence, intelligibility, and speech quality to guide open-source development in Taiwanese Mandarin spoken LLMs.

Abstract

This technical report presents our initial attempt to build a spoken large language model (LLM) for Taiwanese Mandarin, specifically tailored to enable real-time, speech-to-speech interaction in multi-turn conversations. Our end-to-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving the conversational flow, including full-duplex capabilities allowing simultaneous speaking and listening. The paper also details the training process, including data preparation with synthesized dialogues and adjustments for real-time interaction. We also developed a platform to evaluate conversational fluency and response coherence in multi-turn dialogues. We hope the release of the report can contribute to the future development of spoken LLMs in Taiwanese Mandarin.

Building a Taiwanese Mandarin Spoken Language Model: A First Attempt

TL;DR

This paper reports an initial attempt to build a Taiwanese Mandarin spoken LLM with real-time speech-to-speech capabilities. It uses an end-to-end decoder-only transformer initialized from a text LLM, integrating streaming ASR, a speech encoder, and a diffusion-based speech decoder to enable full-duplex dialogue. The training combines extensive pre-training and supervised fine-tuning on a mix of real and synthetic dialogues, with a modality-control prompt scheme enabling text, unit, or hybrid inputs and outputs. An evaluation platform, including multi-turn dialogue assessments and a Forum of Spoken Agents, measures coherence, intelligibility, and speech quality to guide open-source development in Taiwanese Mandarin spoken LLMs.

Abstract

This technical report presents our initial attempt to build a spoken large language model (LLM) for Taiwanese Mandarin, specifically tailored to enable real-time, speech-to-speech interaction in multi-turn conversations. Our end-to-end model incorporates a decoder-only transformer architecture and aims to achieve seamless interaction while preserving the conversational flow, including full-duplex capabilities allowing simultaneous speaking and listening. The paper also details the training process, including data preparation with synthesized dialogues and adjustments for real-time interaction. We also developed a platform to evaluate conversational fluency and response coherence in multi-turn dialogues. We hope the release of the report can contribute to the future development of spoken LLMs in Taiwanese Mandarin.

Paper Structure

This paper contains 47 sections, 1 equation, 9 figures, 3 tables.

Figures (9)

  • Figure 1: An overview of this project's spoken LLM network architecture.
  • Figure 2: Overview of different interaction patterns and our proposed end-of-turn detection method.
  • Figure 3: An illustration of the textual data generation pipeline. In Stage 3, the sentence in bold represents the inserted text, creating the simultaneous speech content. To save space, we simplified the prompt in the figure. The blue text is the English translation of the corresponding Mandarin text.
  • Figure 4: An illustration of how we represent the speaker overlapping or interruption in the input sequences of the model.
  • Figure 5: An illustration of how we extract HuBERT units during real-time inference streamingly.
  • ...and 4 more figures