Table of Contents
Fetching ...

JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis

Jun-Hyeok Cha, Seung-Bin Kim, Hyung-Seok Oh, Seong-Whan Lee

TL;DR

JELLY tackles the challenge of generating emotionally appropriate speech in conversational settings by jointly recognizing emotion and reasoning about context with a large language model. It introduces an Emotion-aware Q-former encoder (EQ-former) to perceive speech emotions and align them with text, and uses multiple PLoRA adapters to fine-tune the LLM for emotional context reasoning. A three-stage training pipeline leverages emotion-text alignment, textual data-based emotional reasoning, and emotion-aware synthesis, enabling inference of emotional state from speech alone. Experiments across diverse datasets show JELLY outperforms baselines on emotional context reasoning and speech synthesis metrics, and the speech-only variant demonstrates robust performance without transcripts or emotion labels.

Abstract

Recently, there has been a growing demand for conversational speech synthesis (CSS) that generates more natural speech by considering the conversational context. To address this, we introduce JELLY, a novel CSS framework that integrates emotion recognition and context reasoning for generating appropriate speech in conversation by fine-tuning a large language model (LLM) with multiple partial LoRA modules. We propose an Emotion-aware Q-former encoder, which enables the LLM to perceive emotions in speech. The encoder is trained to align speech emotions with text, utilizing datasets of emotional speech. The entire model is then fine-tuned with conversational speech data to infer emotional context for generating emotionally appropriate speech in conversation. Our experimental results demonstrate that JELLY excels in emotional context modeling, synthesizing speech that naturally aligns with conversation, while mitigating the scarcity of emotional conversational speech datasets.

JELLY: Joint Emotion Recognition and Context Reasoning with LLMs for Conversational Speech Synthesis

TL;DR

JELLY tackles the challenge of generating emotionally appropriate speech in conversational settings by jointly recognizing emotion and reasoning about context with a large language model. It introduces an Emotion-aware Q-former encoder (EQ-former) to perceive speech emotions and align them with text, and uses multiple PLoRA adapters to fine-tune the LLM for emotional context reasoning. A three-stage training pipeline leverages emotion-text alignment, textual data-based emotional reasoning, and emotion-aware synthesis, enabling inference of emotional state from speech alone. Experiments across diverse datasets show JELLY outperforms baselines on emotional context reasoning and speech synthesis metrics, and the speech-only variant demonstrates robust performance without transcripts or emotion labels.

Abstract

Recently, there has been a growing demand for conversational speech synthesis (CSS) that generates more natural speech by considering the conversational context. To address this, we introduce JELLY, a novel CSS framework that integrates emotion recognition and context reasoning for generating appropriate speech in conversation by fine-tuning a large language model (LLM) with multiple partial LoRA modules. We propose an Emotion-aware Q-former encoder, which enables the LLM to perceive emotions in speech. The encoder is trained to align speech emotions with text, utilizing datasets of emotional speech. The entire model is then fine-tuned with conversational speech data to infer emotional context for generating emotionally appropriate speech in conversation. Our experimental results demonstrate that JELLY excels in emotional context modeling, synthesizing speech that naturally aligns with conversation, while mitigating the scarcity of emotional conversational speech datasets.
Paper Structure (17 sections, 2 figures, 2 tables)

This paper contains 17 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Examples of different emotional contexts that can arise from the same conversation content.
  • Figure 2: The overview of JELLY framework.