Table of Contents
Fetching ...

Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations

Guan-Ting Lin, Cheng-Han Chiang, Hung-yi Lee

TL;DR

This work tackles the challenge of making large language models respond differently to the same spoken content when the speaking style varies. It introduces StyleTalk, a speech-to-speech dataset that pairs identical context and content with multiple speaking styles and corresponding expressive responses, and presents Spoken-LLM, a two-stage multimodal training framework that aligns current speech style and predicts a styled spoken reply. Through objective and subjective evaluations, Spoken-LLM outperforms text-only baselines and prior speech-LLM approaches on both lexical/semantic accuracy and style fidelity, with the chunk-based style embeddings yielding the strongest results. The StyleTalk dataset is released to the community to advance research in style-aware spoken dialogue systems and multimodal LLMs.

Abstract

In spoken dialogue, even if two current turns are the same sentence, their responses might still differ when they are spoken in different styles. The spoken styles, containing paralinguistic and prosodic information, mark the most significant difference between text and speech modality. When using text-only LLMs to model spoken dialogue, text-only LLMs cannot give different responses based on the speaking style of the current turn. In this paper, we focus on enabling LLMs to listen to the speaking styles and respond properly. Our goal is to teach the LLM that "even if the sentences are identical if they are spoken in different styles, their corresponding responses might be different". Since there is no suitable dataset for achieving this goal, we collect a speech-to-speech dataset, StyleTalk, with the following desired characteristics: when two current speeches have the same content but are spoken in different styles, their responses will be different. To teach LLMs to understand and respond properly to the speaking styles, we propose the Spoken-LLM framework that can model the linguistic content and the speaking styles. We train Spoken-LLM using the StyleTalk dataset and devise a two-stage training pipeline to help the Spoken-LLM better learn the speaking styles. Based on extensive experiments, we show that Spoken-LLM outperforms text-only baselines and prior speech LLMs methods.

Advancing Large Language Models to Capture Varied Speaking Styles and Respond Properly in Spoken Conversations

TL;DR

This work tackles the challenge of making large language models respond differently to the same spoken content when the speaking style varies. It introduces StyleTalk, a speech-to-speech dataset that pairs identical context and content with multiple speaking styles and corresponding expressive responses, and presents Spoken-LLM, a two-stage multimodal training framework that aligns current speech style and predicts a styled spoken reply. Through objective and subjective evaluations, Spoken-LLM outperforms text-only baselines and prior speech-LLM approaches on both lexical/semantic accuracy and style fidelity, with the chunk-based style embeddings yielding the strongest results. The StyleTalk dataset is released to the community to advance research in style-aware spoken dialogue systems and multimodal LLMs.

Abstract

In spoken dialogue, even if two current turns are the same sentence, their responses might still differ when they are spoken in different styles. The spoken styles, containing paralinguistic and prosodic information, mark the most significant difference between text and speech modality. When using text-only LLMs to model spoken dialogue, text-only LLMs cannot give different responses based on the speaking style of the current turn. In this paper, we focus on enabling LLMs to listen to the speaking styles and respond properly. Our goal is to teach the LLM that "even if the sentences are identical if they are spoken in different styles, their corresponding responses might be different". Since there is no suitable dataset for achieving this goal, we collect a speech-to-speech dataset, StyleTalk, with the following desired characteristics: when two current speeches have the same content but are spoken in different styles, their responses will be different. To teach LLMs to understand and respond properly to the speaking styles, we propose the Spoken-LLM framework that can model the linguistic content and the speaking styles. We train Spoken-LLM using the StyleTalk dataset and devise a two-stage training pipeline to help the Spoken-LLM better learn the speaking styles. Based on extensive experiments, we show that Spoken-LLM outperforms text-only baselines and prior speech LLMs methods.
Paper Structure (34 sections, 2 equations, 8 figures, 7 tables)

This paper contains 34 sections, 2 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: The overview framework of Spoken-LLM. (c1,r1) and (c2,r2) are the current and response speech sample pairs. c1 and c2 are fed into the model individually.
  • Figure 2: Data collection pipeline of StyleTalk. The details of instruction and prompt template are in the Appendix.
  • Figure 3: Human evaluation result comparing Spoken-LLM-chunk with Text-LLM (text-only) and Text-LLM (cascaded).
  • Figure 4: The output emotion distribution given input emotion. Each row is the probability distribution for an input-output pair.
  • Figure 5: Top-5 and Bottom-5 diverse pairs in the train and evaluation set. The self-BLEU is normalized for each style transition pair to make a fair comparison. The pairs with fewer than 5 pairs are removed. The lower the self-BLEU score, the more diverse the lexical response given different dialogue contexts and input.
  • ...and 3 more figures