EmotionRankCLAP: Bridging Natural Language Speaking Styles and Ordinal Speech Emotion via Rank-N-Contrast
Shreeram Suresh Chandra, Lucas Goncalves, Junchen Lu, Carlos Busso, Berrak Sisman
TL;DR
EmotionRankCLAP tackles the gap in emotion-aware cross-modal learning by treating emotion as an ordinal, continuous construct and aligning audio with natural-language speaking-style prompts. It introduces a supervised Rank-N-Contrast objective that ranks cross-modal pairs in the valence-arousal space, while generating dimension-guided captions via an LLM to bridge speech and text. Empirical results on MSP-Podcast show improved cross-modal alignment and stronger ordinal consistency (valence and arousal) than baselines, demonstrating the benefit of ordinal supervision and textual descriptions. This approach enhances fine-grained, cross-modal emotion understanding with practical implications for emotion-aware retrieval and synthesis tasks.
Abstract
Current emotion-based contrastive language-audio pretraining (CLAP) methods typically learn by naïvely aligning audio samples with corresponding text prompts. Consequently, this approach fails to capture the ordinal nature of emotions, hindering inter-emotion understanding and often resulting in a wide modality gap between the audio and text embeddings due to insufficient alignment. To handle these drawbacks, we introduce EmotionRankCLAP, a supervised contrastive learning approach that uses dimensional attributes of emotional speech and natural language prompts to jointly capture fine-grained emotion variations and improve cross-modal alignment. Our approach utilizes a Rank-N-Contrast objective to learn ordered relationships by contrasting samples based on their rankings in the valence-arousal space. EmotionRankCLAP outperforms existing emotion-CLAP methods in modeling emotion ordinality across modalities, measured via a cross-modal retrieval task.
