Table of Contents
Fetching ...

EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast

Shreeram Suresh Chandra, Lucas Goncalves, Junchen Lu, Carlos Busso, Berrak Sisman

TL;DR

EmotionRankCLAP tackles the gap in emotion-aware cross-modal learning by treating emotion as an ordinal, continuous construct and aligning audio with natural-language speaking-style prompts. It introduces a supervised Rank-N-Contrast objective that ranks cross-modal pairs in the valence-arousal space, while generating dimension-guided captions via an LLM to bridge speech and text. Empirical results on MSP-Podcast show improved cross-modal alignment and stronger ordinal consistency (valence and arousal) than baselines, demonstrating the benefit of ordinal supervision and textual descriptions. This approach enhances fine-grained, cross-modal emotion understanding with practical implications for emotion-aware retrieval and synthesis tasks.

Abstract

Current emotion-based contrastive language-audio pretraining (CLAP) methods typically learn by naïvely aligning audio samples with corresponding text prompts. Consequently, this approach fails to capture the ordinal nature of emotions, hindering inter-emotion understanding and often resulting in a wide modality gap between the audio and text embeddings due to insufficient alignment. To handle these drawbacks, we introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations and improve cross-modal alignment. Our approach utilizes a Rank-N-Contrast objective to learn ordered relationships by contrasting samples based on their rankings in the valence-arousal space. EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.

EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast

TL;DR

EmotionRankCLAP tackles the gap in emotion-aware cross-modal learning by treating emotion as an ordinal, continuous construct and aligning audio with natural-language speaking-style prompts. It introduces a supervised Rank-N-Contrast objective that ranks cross-modal pairs in the valence-arousal space, while generating dimension-guided captions via an LLM to bridge speech and text. Empirical results on MSP-Podcast show improved cross-modal alignment and stronger ordinal consistency (valence and arousal) than baselines, demonstrating the benefit of ordinal supervision and textual descriptions. This approach enhances fine-grained, cross-modal emotion understanding with practical implications for emotion-aware retrieval and synthesis tasks.

Abstract

Current emotion-based contrastive language-audio pretraining (CLAP) methods typically learn by naïvely aligning audio samples with corresponding text prompts. Consequently, this approach fails to capture the ordinal nature of emotions, hindering inter-emotion understanding and often resulting in a wide modality gap between the audio and text embeddings due to insufficient alignment. To handle these drawbacks, we introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations and improve cross-modal alignment. Our approach utilizes a Rank-N-Contrast objective to learn ordered relationships by contrasting samples based on their rankings in the valence-arousal space. EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.

Paper Structure

This paper contains 16 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Illustration of Rank-N-Contrast in a cross-modal setting. The anchor is boxed in blue. (a) A batch of speech-text pairs along with their valence-arousal labels. (b) Positive and negative pair selection via Rank-N-Contrast criteria.
  • Figure 2: Prompt used to generate emotional style descriptions based on valence-arousal values.
  • Figure 3: Cross-Modality Emotion Ordinality Test: This figure shows a three-sample example for valence ordinal consistency, while the actual evaluation uses 14 samples per list, repeated across 100 lists for both valence and arousal.