Table of Contents
Fetching ...

Re-ENACT: Reinforcement Learning for Emotional Speech Generation using Actor-Critic Strategy

Ravi Shankar, Archana Venkataraman

TL;DR

This work introduces a novel actor-critic reinforcement learning framework to modify emotional prosody by locating contiguous emotion-relevant segments with a Markov Bernoulli mask, predicting soft emotion scores, and applying prosodic edits (pitch, intensity, rhythm) via WSOLA. A neural salience predictor and a discretized action space enable backpropagation-free optimization through non-differentiable WSOLA, enabling unsupervised or weakly supervised emotion conversion. Empirical results on VESUS and CREMAD demonstrate reliable salience prediction and competitive emotion-conversion performance against both supervised and unsupervised baselines, with positive subjective feedback and a manageable intelligibility impact. The approach offers a unified, data-efficient path to emotional speech generation and has potential utility for TTS conditioning and data augmentation, albeit with caveats about intelligibility and artifact control.

Abstract

In this paper, we propose the first method to modify the prosodic features of a given speech signal using actor-critic reinforcement learning strategy. Our approach uses a Bayesian framework to identify contiguous segments of importance that links segments of the given utterances to perception of emotions in humans. We train a neural network to produce the variational posterior of a collection of Bernoulli random variables; our model applies a Markov prior on it to ensure continuity. A sample from this distribution is used for downstream emotion prediction. Further, we train the neural network to predict a soft assignment over emotion categories as the target variable. In the next step, we modify the prosodic features (pitch, intensity, and rhythm) of the masked segment to increase the score of target emotion. We employ an actor-critic reinforcement learning to train the prosody modifier by discretizing the space of modifications. Further, it provides a simple solution to the problem of gradient computation through WSOLA operation for rhythm manipulation. Our experiments demonstrate that this framework changes the perceived emotion of a given speech utterance to the target. Further, we show that our unified technique is on par with state-of-the-art emotion conversion models from supervised and unsupervised domains that require pairwise training.

Re-ENACT: Reinforcement Learning for Emotional Speech Generation using Actor-Critic Strategy

TL;DR

This work introduces a novel actor-critic reinforcement learning framework to modify emotional prosody by locating contiguous emotion-relevant segments with a Markov Bernoulli mask, predicting soft emotion scores, and applying prosodic edits (pitch, intensity, rhythm) via WSOLA. A neural salience predictor and a discretized action space enable backpropagation-free optimization through non-differentiable WSOLA, enabling unsupervised or weakly supervised emotion conversion. Empirical results on VESUS and CREMAD demonstrate reliable salience prediction and competitive emotion-conversion performance against both supervised and unsupervised baselines, with positive subjective feedback and a manageable intelligibility impact. The approach offers a unified, data-efficient path to emotional speech generation and has potential utility for TTS conditioning and data augmentation, albeit with caveats about intelligibility and artifact control.

Abstract

In this paper, we propose the first method to modify the prosodic features of a given speech signal using actor-critic reinforcement learning strategy. Our approach uses a Bayesian framework to identify contiguous segments of importance that links segments of the given utterances to perception of emotions in humans. We train a neural network to produce the variational posterior of a collection of Bernoulli random variables; our model applies a Markov prior on it to ensure continuity. A sample from this distribution is used for downstream emotion prediction. Further, we train the neural network to predict a soft assignment over emotion categories as the target variable. In the next step, we modify the prosodic features (pitch, intensity, and rhythm) of the masked segment to increase the score of target emotion. We employ an actor-critic reinforcement learning to train the prosody modifier by discretizing the space of modifications. Further, it provides a simple solution to the problem of gradient computation through WSOLA operation for rhythm manipulation. Our experiments demonstrate that this framework changes the perceived emotion of a given speech utterance to the target. Further, we show that our unified technique is on par with state-of-the-art emotion conversion models from supervised and unsupervised domains that require pairwise training.
Paper Structure (16 sections, 8 equations, 10 figures, 2 tables)

This paper contains 16 sections, 8 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Overlap add operation to stretch the input signal.
  • Figure 2: RL Strategy: Reinforcement learning framework for predicting factor of modification. The grey panel summarizes the state of the observer, the red panel constitutes the action space, and the green panel represents the environment with WSOLA.
  • Figure 3: (a) Example of salience score obtained from AMT and (b) Transition diagram for masking random variables.
  • Figure 4: Neural network used for prediction of human perception of emotional saliency. The architecture has three components: (a) feature extraction from raw waveform, (b) mask generator using Markov masking and (c) salience prediction.
  • Figure 5: RL architecture used for factor prediction.
  • ...and 5 more figures