Table of Contents
Fetching ...

Domain Adapting Deep Reinforcement Learning for Real-world Speech Emotion Recognition

Thejan Rajapakshe, Rajib Rana, Sara Khalifa, Bjorn W. Schuller

TL;DR

The paper tackles the challenge of adapting speech emotion recognition (SER) models to real-world, dynamic domains. It introduces RL-DA, a reinforcement learning-based domain adaptation framework that pre-trains a SER model on a source corpus and then uses a Deep Q-Network to adapt to a target domain through continual feedback during inference. Across cross-corpus and cross-language tasks, RL-DA demonstrates substantial improvements over supervised-domain adaptation baselines, including in simulated live-data and noisy conditions, with average gains around 11–14%. The work also provides a public demonstration platform and discusses practical considerations for online learning and scenario-specific testing, highlighting real-world applicability and future directions for robust emotion-enabled systems.

Abstract

Computers can understand and then engage with people in an emotionally intelligent way thanks to speech-emotion recognition (SER). However, the performance of SER in cross-corpus and real-world live data feed scenarios can be significantly improved. The inability to adapt an existing model to a new domain is one of the shortcomings of SER methods. To address this challenge, researchers have developed domain adaptation techniques that transfer knowledge learnt by a model across the domain. Although existing domain adaptation techniques have improved performances across domains, they can be improved to adapt to a real-world live data feed situation where a model can self-tune while deployed. In this paper, we present a deep reinforcement learning-based strategy (RL-DA) for adapting a pre-trained model to a real-world live data feed setting while interacting with the environment and collecting continual feedback. RL-DA is evaluated on SER tasks, including cross-corpus and cross-language domain adaption schema. Evaluation results show that in a live data feed setting, RL-DA outperforms a baseline strategy by 11% and 14% in cross-corpus and cross-language scenarios, respectively.

Domain Adapting Deep Reinforcement Learning for Real-world Speech Emotion Recognition

TL;DR

The paper tackles the challenge of adapting speech emotion recognition (SER) models to real-world, dynamic domains. It introduces RL-DA, a reinforcement learning-based domain adaptation framework that pre-trains a SER model on a source corpus and then uses a Deep Q-Network to adapt to a target domain through continual feedback during inference. Across cross-corpus and cross-language tasks, RL-DA demonstrates substantial improvements over supervised-domain adaptation baselines, including in simulated live-data and noisy conditions, with average gains around 11–14%. The work also provides a public demonstration platform and discusses practical considerations for online learning and scenario-specific testing, highlighting real-world applicability and future directions for robust emotion-enabled systems.

Abstract

Computers can understand and then engage with people in an emotionally intelligent way thanks to speech-emotion recognition (SER). However, the performance of SER in cross-corpus and real-world live data feed scenarios can be significantly improved. The inability to adapt an existing model to a new domain is one of the shortcomings of SER methods. To address this challenge, researchers have developed domain adaptation techniques that transfer knowledge learnt by a model across the domain. Although existing domain adaptation techniques have improved performances across domains, they can be improved to adapt to a real-world live data feed situation where a model can self-tune while deployed. In this paper, we present a deep reinforcement learning-based strategy (RL-DA) for adapting a pre-trained model to a real-world live data feed setting while interacting with the environment and collecting continual feedback. RL-DA is evaluated on SER tasks, including cross-corpus and cross-language domain adaption schema. Evaluation results show that in a live data feed setting, RL-DA outperforms a baseline strategy by 11% and 14% in cross-corpus and cross-language scenarios, respectively.
Paper Structure (32 sections, 5 equations, 6 figures, 8 tables)

This paper contains 32 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Overarching design for incorporating reinforcement learning in domain adaptation to enhance the accuracy of speech emotion recognition. Initially, the Base Model is pre-trained on a source dataset and subsequently optimised for the target domain using reinforcement learning, aided by user feedback and produced a Domain Adapted Model.
  • Figure 2: The architecture of a Reinforcement Learning system featuring a Deep Q Network. The RL Agent comprises several constituent parts: Memory, RL policy, and DQN. Interacting with a simulated environment, the Agent receives emotional cues (actions) input and produces an output in response, which comprises feedback and the subsequent audio utterance (state).
  • Figure 3: Composition of datasets used in Target and Source: (a) the Target dataset is not mixed with Source data and (b) the Target dataset is mixed with Source data. The source dataset is used to pre-train the base model while the target dataset is used for domain adaptation.
  • Figure 4: Comparison of accuracy of each RL-DA experiment with separate datasets vs mixed datasets
  • Figure 5: Comparison of UAR of SL-DA - rw and the proposed RL-DA approach, without (w/o) background noise and with (w/) background noise of each cross-corpus and cross-language schema simulating live data feed scenario.
  • ...and 1 more figures