Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

Yinghao Aaron Li; Xilin Jiang; Jordan Darefsky; Ge Zhu; Nima Mesgarani

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

Yinghao Aaron Li, Xilin Jiang, Jordan Darefsky, Ge Zhu, Nima Mesgarani

TL;DR

Style-Talker proposes a hybrid spoken dialog system that fine-tunes an audio LLM and a style-based TTS to generate both response text and speaking style directly from input speech and conversation history. By integrating input audio context and prior turns, it eliminates the pre-LLM ASR step, substantially reducing latency while preserving paralinguistic cues. Evaluations on DailyTalk and PodcastFillers show Style-Talker outperforms cascade and end-to-end baselines in naturalness and coherence, with real-time performance improvements of approximately 2x to over 4x, enabling more realistic and responsive spoken dialogue even in the wild. This approach advances practical, high-quality speech-to-speech dialogue with minimal labeling needs and broad applicability to real-world data.

Abstract

The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and contextually relevant dialogues. However, extending these advancements to enable end-to-end speech-to-speech conversation bots remains a formidable challenge, primarily due to the extensive dataset and computational resources required. The conventional approach of cascading automatic speech recognition (ASR), LLM, and text-to-speech (TTS) models in a pipeline, while effective, suffers from unnatural prosody because it lacks direct interactions between the input audio and its transcribed text and the output audio. These systems are also limited by their inherent latency from the ASR process for real-time applications. This paper introduces Style-Talker, an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation. Style-Talker takes user input audio and uses transcribed chat history and speech styles to generate both the speaking style and text for the response. Subsequently, the TTS model synthesizes the speech, which is then played back to the user. While the response speech is being played, the input speech undergoes ASR processing to extract the transcription and speaking style, serving as the context for the ensuing dialogue turn. This novel pipeline accelerates the traditional cascade ASR-LLM-TTS systems while integrating rich paralinguistic information from input speech. Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence while being more than 50% faster.

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

TL;DR

Abstract

Paper Structure (27 sections, 3 equations, 2 figures, 7 tables)

This paper contains 27 sections, 3 equations, 2 figures, 7 tables.

Introduction
Related Works
Text-Based Approaches and Paralinguistic Decoupling
End-to-End Speech-to-Speech Generation
Methods
Style-Talker
StyleTTS 2
Qwen-Audio
Conversation Context
Training Objectives
Experiments
Datasets
Evaluations
Subjective Evaluations
Objective Evaluations
...and 12 more sections

Figures (2)

Figure 1: An overview of SDS with a comparison between the conventional cascaded system and Style-Talker. The cascaded system has three steps: input speech transcription (ASR), response text generation (LLM), and response speech synthesis (TTS). Style-Talker adopts an audio LLM to merge the first two steps and generate a response text and corresponding style directly, preserving the content prosody while achieving generation efficiency.
Figure 2: Model components and processing pipelines of Style-Talker for response generation. Audio $\bm{x}_n$ from the incoming speaker, a reference and past speaker styles in conversation, and transcriptions from previous rounds ($\bm{c}_n$) are all embedded into the same space by the audio encoder, the style projection $\mathcal{P}_\text{in}$, and the text tokenizer and embedder (not shown), respectively. They are jointly processed by an LLM QwenLM to generate the response text $\bm{t}_{n+1}$ and style $\bm{s}_{n+1}$ for StyleTTS 2 to synthesize a response speech $\bm{x}_{n+1}$.

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

TL;DR

Abstract

Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (2)