Table of Contents
Fetching ...

Emotion-Aware Contrastive Adaptation Network for Source-Free Cross-Corpus Speech Emotion Recognition

Yan Zhao, Jincen Wang, Cheng Lu, Sunan Li, Björn Schuller, Yuan Zong, Wenming Zheng

TL;DR

The paper tackles source-free cross-corpus speech emotion recognition, where a pre-trained source model must adapt to a target corpus without access to the source data. It introduces the Emotion-Aware Contrastive Adaptation Network (ECAN), which combines nearest neighbor contrastive learning with a memory bank for local semantic consistency and supervised contrastive learning with a score bank for robust global class separation, augmented by a diversity regularizer. The total objective, $L = L_{div} + \lambda L_{ncl} + \beta L_{scl}$, enforces balanced predictions while promoting intra-class compactness and inter-class separation, yielding improved cross-corpus transfer. Experiments on four corpora (EMOVO, EmoDB, eNTERFACE, CASIA) across 12 source-target tasks show ECAN achieving a mean unweighted average recall of 37.19%, outperforming several state-of-the-art baselines in the source-free setting and demonstrating practical potential for privacy-preserving SER transfer.

Abstract

Cross-corpus speech emotion recognition (SER) aims to transfer emotional knowledge from a labeled source corpus to an unlabeled corpus. However, prior methods require access to source data during adaptation, which is unattainable in real-life scenarios due to data privacy protection concerns. This paper tackles a more practical task, namely source-free cross-corpus SER, where a pre-trained source model is adapted to the target domain without access to source data. To address the problem, we propose a novel method called emotion-aware contrastive adaptation network (ECAN). The core idea is to capture local neighborhood information between samples while considering the global class-level adaptation. Specifically, we propose a nearest neighbor contrastive learning to promote local emotion consistency among features of highly similar samples. Furthermore, relying solely on nearest neighborhoods may lead to ambiguous boundaries between clusters. Thus, we incorporate supervised contrastive learning to encourage greater separation between clusters representing different emotions, thereby facilitating improved class-level adaptation. Extensive experiments indicate that our proposed ECAN significantly outperforms state-of-the-art methods under the source-free cross-corpus SER setting on several speech emotion corpora.

Emotion-Aware Contrastive Adaptation Network for Source-Free Cross-Corpus Speech Emotion Recognition

TL;DR

The paper tackles source-free cross-corpus speech emotion recognition, where a pre-trained source model must adapt to a target corpus without access to the source data. It introduces the Emotion-Aware Contrastive Adaptation Network (ECAN), which combines nearest neighbor contrastive learning with a memory bank for local semantic consistency and supervised contrastive learning with a score bank for robust global class separation, augmented by a diversity regularizer. The total objective, , enforces balanced predictions while promoting intra-class compactness and inter-class separation, yielding improved cross-corpus transfer. Experiments on four corpora (EMOVO, EmoDB, eNTERFACE, CASIA) across 12 source-target tasks show ECAN achieving a mean unweighted average recall of 37.19%, outperforming several state-of-the-art baselines in the source-free setting and demonstrating practical potential for privacy-preserving SER transfer.

Abstract

Cross-corpus speech emotion recognition (SER) aims to transfer emotional knowledge from a labeled source corpus to an unlabeled corpus. However, prior methods require access to source data during adaptation, which is unattainable in real-life scenarios due to data privacy protection concerns. This paper tackles a more practical task, namely source-free cross-corpus SER, where a pre-trained source model is adapted to the target domain without access to source data. To address the problem, we propose a novel method called emotion-aware contrastive adaptation network (ECAN). The core idea is to capture local neighborhood information between samples while considering the global class-level adaptation. Specifically, we propose a nearest neighbor contrastive learning to promote local emotion consistency among features of highly similar samples. Furthermore, relying solely on nearest neighborhoods may lead to ambiguous boundaries between clusters. Thus, we incorporate supervised contrastive learning to encourage greater separation between clusters representing different emotions, thereby facilitating improved class-level adaptation. Extensive experiments indicate that our proposed ECAN significantly outperforms state-of-the-art methods under the source-free cross-corpus SER setting on several speech emotion corpora.
Paper Structure (13 sections, 5 equations, 3 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 5 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview Structure of the Proposed ECAN in Dealing with Source-free Cross-Corpus SER.
  • Figure 2: Comparison with Cross-Corpus SER Methods.
  • Figure 3: The t-SNE visualization on the task of C$\rightarrow$B.