RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

Haoqin Sun; Jingguang Tian; Jiaming Zhou; Hui Wang; Jiabei He; Shiwan Zhao; Xiangyu Kong; Desheng Hu; Xinkang Xu; Xinhui Hu; Yong Qin

RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

Haoqin Sun, Jingguang Tian, Jiaming Zhou, Hui Wang, Jiabei He, Shiwan Zhao, Xiangyu Kong, Desheng Hu, Xinkang Xu, Xinhui Hu, Yong Qin

TL;DR

The paper tackles emotional speaking style retrieval (ESSR) by introducing ESS-CLAP, a cross-modal contrastive pretraining framework that aligns speech with natural language descriptions. It extends this approach with RA-CLAP, a two-stage method that uses self-distillation to learn partial, local speech-text matches beyond binary caption-audio relations. Through pre-training with a dual-encoder and InfoNCE loss, followed by distillation from a teacher model, RA-CLAP improves generalization across diverse ESSD datasets, including PromptSpeech, TextrolSpeech, and SpeechCraft. The results demonstrate the feasibility and benefits of contrastive pretraining for ESSD, laying groundwork for more expressive emotional speaking style captioning and prompt-driven speech synthesis tasks.

Abstract

The Contrastive Language-Audio Pretraining (CLAP) model has demonstrated excellent performance in general audio description-related tasks, such as audio retrieval. However, in the emerging field of emotional speaking style description (ESSD), cross-modal contrastive pretraining remains largely unexplored. In this paper, we propose a novel speech retrieval task called emotional speaking style retrieval (ESSR), and ESS-CLAP, an emotional speaking style CLAP model tailored for learning relationship between speech and natural language descriptions. In addition, we further propose relation-augmented CLAP (RA-CLAP) to address the limitation of traditional methods that assume a strict binary relationship between caption and audio. The model leverages self-distillation to learn the potential local matching relationships between speech and descriptions, thereby enhancing generalization ability. The experimental results validate the effectiveness of RA-CLAP, providing valuable reference in ESSD.

RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

TL;DR

Abstract

RA-CLAP: Relation-Augmented Emotional Speaking Style Contrastive Language-Audio Pretraining For Speech Retrieval

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)