Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
Yinghao Aaron Li, Xilin Jiang, Jordan Darefsky, Ge Zhu, Nima Mesgarani
TL;DR
Style-Talker proposes a hybrid spoken dialog system that fine-tunes an audio LLM and a style-based TTS to generate both response text and speaking style directly from input speech and conversation history. By integrating input audio context and prior turns, it eliminates the pre-LLM ASR step, substantially reducing latency while preserving paralinguistic cues. Evaluations on DailyTalk and PodcastFillers show Style-Talker outperforms cascade and end-to-end baselines in naturalness and coherence, with real-time performance improvements of approximately 2x to over 4x, enabling more realistic and responsive spoken dialogue even in the wild. This approach advances practical, high-quality speech-to-speech dialogue with minimal labeling needs and broad applicability to real-world data.
Abstract
The rapid advancement of large language models (LLMs) has significantly propelled the development of text-based chatbots, demonstrating their capability to engage in coherent and contextually relevant dialogues. However, extending these advancements to enable end-to-end speech-to-speech conversation bots remains a formidable challenge, primarily due to the extensive dataset and computational resources required. The conventional approach of cascading automatic speech recognition (ASR), LLM, and text-to-speech (TTS) models in a pipeline, while effective, suffers from unnatural prosody because it lacks direct interactions between the input audio and its transcribed text and the output audio. These systems are also limited by their inherent latency from the ASR process for real-time applications. This paper introduces Style-Talker, an innovative framework that fine-tunes an audio LLM alongside a style-based TTS model for fast spoken dialog generation. Style-Talker takes user input audio and uses transcribed chat history and speech styles to generate both the speaking style and text for the response. Subsequently, the TTS model synthesizes the speech, which is then played back to the user. While the response speech is being played, the input speech undergoes ASR processing to extract the transcription and speaking style, serving as the context for the ensuing dialogue turn. This novel pipeline accelerates the traditional cascade ASR-LLM-TTS systems while integrating rich paralinguistic information from input speech. Our experimental results show that Style-Talker significantly outperforms the conventional cascade and speech-to-speech baselines in terms of both dialogue naturalness and coherence while being more than 50% faster.
