Table of Contents
Fetching ...

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

Guan-Ting Lin, Prashanth Gurunath Shivakumar, Ankur Gandhe, Chao-Han Huck Yang, Yile Gu, Shalini Ghosh, Andreas Stolcke, Hung-yi Lee, Ivan Bulyko

TL;DR

ParalinGPT addresses the gap in spoken-dialogue modeling by integrating paralinguistic cues through continuous speech embeddings with text in a serialized multitasking framework. It jointly predicts current sentiment, response sentiment, and generates the next response text using a DialoGPT backbone and wav2vec 2.0 embeddings, evaluated on Switchboard-1 with sentiment annotations. The approach yields improvements in sentiment accuracy for both current and response turns and benefits from longer multi-turn context and multimodal input, though response text BLEU can be affected by sentiment noise. The work demonstrates that incorporating paralinguistic information and context improves naturalness and relevance of spoken-dialogue responses, with potential for broader applications and future extensions to additional paralinguistic tasks.

Abstract

Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore propose Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT), an LLM that utilizes text and speech modalities to better model the linguistic content and paralinguistic attributes of spoken dialogue. The model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking multimodal framework. Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset. Experimental results indicate the proposed serialized multitasking method outperforms typical sequence classification techniques on current and response sentiment classification. Furthermore, leveraging conversational context and speech embeddings significantly improves both response text generation and sentiment prediction. Our proposed framework achieves relative improvements of 6.7%, 12.0%, and 3.5% in current sentiment accuracy, response sentiment accuracy, and response text BLEU score, respectively.

Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue

TL;DR

ParalinGPT addresses the gap in spoken-dialogue modeling by integrating paralinguistic cues through continuous speech embeddings with text in a serialized multitasking framework. It jointly predicts current sentiment, response sentiment, and generates the next response text using a DialoGPT backbone and wav2vec 2.0 embeddings, evaluated on Switchboard-1 with sentiment annotations. The approach yields improvements in sentiment accuracy for both current and response turns and benefits from longer multi-turn context and multimodal input, though response text BLEU can be affected by sentiment noise. The work demonstrates that incorporating paralinguistic information and context improves naturalness and relevance of spoken-dialogue responses, with potential for broader applications and future extensions to additional paralinguistic tasks.

Abstract

Large Language Models (LLMs) have demonstrated superior abilities in tasks such as chatting, reasoning, and question-answering. However, standard LLMs may ignore crucial paralinguistic information, such as sentiment, emotion, and speaking style, which are essential for achieving natural, human-like spoken conversation, especially when such information is conveyed by acoustic cues. We therefore propose Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT), an LLM that utilizes text and speech modalities to better model the linguistic content and paralinguistic attributes of spoken dialogue. The model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking multimodal framework. Specifically, our framework serializes tasks in the order of current paralinguistic attribute prediction, response paralinguistic attribute prediction, and response text generation with autoregressive conditioning. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset. Experimental results indicate the proposed serialized multitasking method outperforms typical sequence classification techniques on current and response sentiment classification. Furthermore, leveraging conversational context and speech embeddings significantly improves both response text generation and sentiment prediction. Our proposed framework achieves relative improvements of 6.7%, 12.0%, and 3.5% in current sentiment accuracy, response sentiment accuracy, and response text BLEU score, respectively.
Paper Structure (18 sections, 1 equation, 3 figures, 3 tables)

This paper contains 18 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Speech dialogue scenario. <> denotes paralinguistic sentiment associated with an utterance. "Context History" refers to conversational turns before the "Current Turn", and "Response Turn" is what follows the current turn.
  • Figure 2: ParalinGPT and serialized multitasking: The history context (including text, speech, and sentiment labels), current text, and current speech encoding are the input prompt for the ParalinGPT LLM. The prediction target is the current sentiment label, the response sentiment label, and the response text via autoregression.
  • Figure 3: Effect of context length on response sentiment classification and text generation with the proposed method.