Table of Contents
Fetching ...

Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition

Dongyuan Li, Ying Zhang, Yusong Wang, Funakoshi Kataro, Manabu Okumura

TL;DR

This work tackles the gap between pre-trained ASR representations and downstream SER tasks by introducing AFTER, a framework that combines Task Adaptation Pre-training (TAPT) with Active Learning (AL) to minimize labeling needs and improve generalization in noisy, real-world data. By employing TAPT to bridge task information and a clustering-informed AL initialization, AFTER iteratively selects informative and diverse samples for fine-tuning a wav2vec 2.0-based SER model, achieving substantial performance gains with only 20% labeled data and markedly reduced training time. Extensive experiments on IEMOCAP, SAVEE, and the Merged dataset (including spontaneous and acted mixtures) demonstrate state-of-the-art results, with up to about 8–9 percentage points UA/WA improvements and near-80% time savings. The paper also extends AFTER to HuBERT, multiple annotators, and soft-label settings, and provides datasets and code to support reproducibility, highlighting its practical value for robust SER in real-world, heterogeneous environments.

Abstract

Speech emotion recognition (SER) has garnered increasing attention due to its wide range of applications in various fields, including human-machine interaction, virtual assistants, and mental health assistance. However, existing SER methods often overlook the information gap between the pre-training speech recognition task and the downstream SER task, resulting in sub-optimal performance. Moreover, current methods require much time for fine-tuning on each specific speech dataset, such as IEMOCAP, which limits their effectiveness in real-world scenarios with large-scale noisy data. To address these issues, we propose an active learning (AL)-based fine-tuning framework for SER, called \textsc{After}, that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. Specifically, we first use TAPT to minimize the information gap between the pre-training speech recognition task and the downstream speech emotion recognition task. Then, AL methods are employed to iteratively select a subset of the most informative and diverse samples for fine-tuning, thereby reducing time consumption. Experiments demonstrate that our proposed method \textsc{After}, using only 20\% of samples, improves accuracy by 8.45\% and reduces time consumption by 79\%. The additional extension of \textsc{After} and ablation studies further confirm its effectiveness and applicability to various real-world scenarios. Our source code is available on Github for reproducibility. (https://github.com/Clearloveyuan/AFTER).

Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition

TL;DR

This work tackles the gap between pre-trained ASR representations and downstream SER tasks by introducing AFTER, a framework that combines Task Adaptation Pre-training (TAPT) with Active Learning (AL) to minimize labeling needs and improve generalization in noisy, real-world data. By employing TAPT to bridge task information and a clustering-informed AL initialization, AFTER iteratively selects informative and diverse samples for fine-tuning a wav2vec 2.0-based SER model, achieving substantial performance gains with only 20% labeled data and markedly reduced training time. Extensive experiments on IEMOCAP, SAVEE, and the Merged dataset (including spontaneous and acted mixtures) demonstrate state-of-the-art results, with up to about 8–9 percentage points UA/WA improvements and near-80% time savings. The paper also extends AFTER to HuBERT, multiple annotators, and soft-label settings, and provides datasets and code to support reproducibility, highlighting its practical value for robust SER in real-world, heterogeneous environments.

Abstract

Speech emotion recognition (SER) has garnered increasing attention due to its wide range of applications in various fields, including human-machine interaction, virtual assistants, and mental health assistance. However, existing SER methods often overlook the information gap between the pre-training speech recognition task and the downstream SER task, resulting in sub-optimal performance. Moreover, current methods require much time for fine-tuning on each specific speech dataset, such as IEMOCAP, which limits their effectiveness in real-world scenarios with large-scale noisy data. To address these issues, we propose an active learning (AL)-based fine-tuning framework for SER, called \textsc{After}, that leverages task adaptation pre-training (TAPT) and AL methods to enhance performance and efficiency. Specifically, we first use TAPT to minimize the information gap between the pre-training speech recognition task and the downstream speech emotion recognition task. Then, AL methods are employed to iteratively select a subset of the most informative and diverse samples for fine-tuning, thereby reducing time consumption. Experiments demonstrate that our proposed method \textsc{After}, using only 20\% of samples, improves accuracy by 8.45\% and reduces time consumption by 79\%. The additional extension of \textsc{After} and ablation studies further confirm its effectiveness and applicability to various real-world scenarios. Our source code is available on Github for reproducibility. (https://github.com/Clearloveyuan/AFTER).
Paper Structure (28 sections, 14 equations, 6 figures, 10 tables, 1 algorithm)

This paper contains 28 sections, 14 equations, 6 figures, 10 tables, 1 algorithm.

Figures (6)

  • Figure 1: Model overview. First, we pre-train an off-the-shelf wav2vec 2.0 in the TAPT manner. Then, we adopt an AL method to select unlabeled samples for iterative annotation. These labeled samples are used to fine-tune the wav2vec 2.0 model for SER.
  • Figure 2: Ratio of labeled samples vs. Unweighted Accuracy.
  • Figure 3: Comparison of various initialization methods for AL, with Entropy employed as the active learning strategy. Initialization involves selecting 1% of the samples.
  • Figure 4: t-SNE visualization of After and randomly sampled methods. The selected samples are represented with red colors on the IEMOCAP and Merged dataset by either randomly sampling or After.
  • Figure 5: (A) Time Consumption Comparison and (B) Relationship between ratio of labeled samples and time consumption.
  • ...and 1 more figures