Table of Contents
Fetching ...

Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset

Rui Liu, Pu Gao, Jiatian Xi, Berrak Sisman, Carlos Busso, Haizhou Li

TL;DR

The paper tackles emotional inconsistency in text-based speech editing by introducing EmoCorrector, a retrieval-augmented post-correction framework that uses cross-modal emotion retrieval and speaker-emotion disentanglement to align edited text emotions with synthesized speech while preserving speaker identity. It also introduces the Emotion Correction Dataset for TSE (ECD-TSE), enabling emotion-rich text variations and corresponding emotional speech generation for robust evaluation. The approach combines EmoCLAP based cross-modal emotion pretraining, adversarial disentanglement, and a three-stage post-correction pipeline that retrieves emotion references, conditions synthesis on joint emotion and speaker embeddings, and delivers emotionally consistent edited speech. Experimental results on ECD-TSE show substantial improvements in emotional alignment (TSE-MOS, TSE-Acc, ECS) with preserved speech quality, highlighting practical implications for emotionally aware TSE in content creation and media applications.

Abstract

Text-based speech editing (TSE) modifies speech using only text, eliminating re-recording. However, existing TSE methods, mainly focus on the content accuracy and acoustic consistency of synthetic speech segments, and often overlook the emotional shifts or inconsistency issues introduced by text changes. To address this issue, we propose EmoCorrector, a novel post-correction scheme for TSE. EmoCorrector leverages Retrieval-Augmented Generation (RAG) by extracting the edited text's emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion while preserving the speaker's identity and quality. To support the training and evaluation of emotional consistency modeling in TSE, we pioneer the benchmarking Emotion Correction Dataset for TSE (ECD-TSE). The prominent aspect of ECD-TSE is its inclusion of $<$text, speech$>$ paired data featuring diverse text variations and a range of emotional expressions. Subjective and objective experiments and comprehensive analysis on ECD-TSE confirm that EmoCorrector significantly enhances the expression of intended emotion while addressing emotion inconsistency limitations in current TSE methods. Code and audio examples are available at https://github.com/AI-S2-Lab/EmoCorrector.

Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset

TL;DR

The paper tackles emotional inconsistency in text-based speech editing by introducing EmoCorrector, a retrieval-augmented post-correction framework that uses cross-modal emotion retrieval and speaker-emotion disentanglement to align edited text emotions with synthesized speech while preserving speaker identity. It also introduces the Emotion Correction Dataset for TSE (ECD-TSE), enabling emotion-rich text variations and corresponding emotional speech generation for robust evaluation. The approach combines EmoCLAP based cross-modal emotion pretraining, adversarial disentanglement, and a three-stage post-correction pipeline that retrieves emotion references, conditions synthesis on joint emotion and speaker embeddings, and delivers emotionally consistent edited speech. Experimental results on ECD-TSE show substantial improvements in emotional alignment (TSE-MOS, TSE-Acc, ECS) with preserved speech quality, highlighting practical implications for emotionally aware TSE in content creation and media applications.

Abstract

Text-based speech editing (TSE) modifies speech using only text, eliminating re-recording. However, existing TSE methods, mainly focus on the content accuracy and acoustic consistency of synthetic speech segments, and often overlook the emotional shifts or inconsistency issues introduced by text changes. To address this issue, we propose EmoCorrector, a novel post-correction scheme for TSE. EmoCorrector leverages Retrieval-Augmented Generation (RAG) by extracting the edited text's emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion while preserving the speaker's identity and quality. To support the training and evaluation of emotional consistency modeling in TSE, we pioneer the benchmarking Emotion Correction Dataset for TSE (ECD-TSE). The prominent aspect of ECD-TSE is its inclusion of text, speech paired data featuring diverse text variations and a range of emotional expressions. Subjective and objective experiments and comprehensive analysis on ECD-TSE confirm that EmoCorrector significantly enhances the expression of intended emotion while addressing emotion inconsistency limitations in current TSE methods. Code and audio examples are available at https://github.com/AI-S2-Lab/EmoCorrector.

Paper Structure

This paper contains 14 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Our approach lies in correcting the emotional mismatch or inconsistency issue of traditional TSE methods.
  • Figure 2: The overall workflow of EmoCorrector.