Table of Contents
Fetching ...

Audio-to-Audio Emotion Conversion With Pitch And Duration Style Transfer

Soumya Dutta, Avni Jain, Sriram Ganapathy

TL;DR

A2A-ZEST presents a fully self-supervised, zero-shot framework for audio-to-audio emotion style transfer that preserves content and speaker identity while transferring emotion from a reference utterance. The method decomposes speech into discrete content tokens, speaker embeddings, and emotion embeddings, then uses a pitch contour predictor and a duration model to realize emotion-conditioned modifications, with synthesis performed by BigVGAN. Across extensive objective and subjective evaluations, A2A-ZEST outperforms prior baselines in emotion transfer fidelity and demonstrates viable data augmentation potential for speech emotion recognition, albeit with limitations in generalization to unseen speakers due to training data scale. The work highlights the effectiveness of disentangled representations and zero-shot transfer for realistic, label-free emotional speech synthesis and provides a pathway toward scalable, nonparallel A2A emotion transfer applications.

Abstract

Given a pair of source and reference speech recordings, audio-to-audio (A2A) style transfer involves the generation of an output speech that mimics the style characteristics of the reference while preserving the content and speaker attributes of the source. In this paper, we propose a novel framework, termed as A2A Zero-shot Emotion Style Transfer (A2A-ZEST), that enables the transfer of reference emotional attributes to the source while retaining its speaker and speech contents. The A2A-ZEST framework consists of an analysis-synthesis pipeline, where the analysis module decomposes speech into semantic tokens, speaker representations, and emotion embeddings. Using these representations, a pitch contour estimator and a duration predictor are learned. Further, a synthesis module is designed to generate speech based on the input representations and the derived factors. This entire paradigm of analysis-synthesis is trained purely in a self-supervised manner with an auto-encoding loss. For A2A emotion style transfer, the emotion embedding extracted from the reference speech along with the rest of the representations from the source speech are used in the synthesis module to generate the style translated speech. In our experiments, we evaluate the converted speech on content/speaker preservation (w.r.t. source) as well as on the effectiveness of the emotion style transfer (w.r.t. reference). The proposal, A2A-ZEST, is shown to improve over other prior works on these evaluations, thereby enabling style transfer without any parallel training data. We also illustrate the application of the proposed work for data augmentation in emotion recognition tasks.

Audio-to-Audio Emotion Conversion With Pitch And Duration Style Transfer

TL;DR

A2A-ZEST presents a fully self-supervised, zero-shot framework for audio-to-audio emotion style transfer that preserves content and speaker identity while transferring emotion from a reference utterance. The method decomposes speech into discrete content tokens, speaker embeddings, and emotion embeddings, then uses a pitch contour predictor and a duration model to realize emotion-conditioned modifications, with synthesis performed by BigVGAN. Across extensive objective and subjective evaluations, A2A-ZEST outperforms prior baselines in emotion transfer fidelity and demonstrates viable data augmentation potential for speech emotion recognition, albeit with limitations in generalization to unseen speakers due to training data scale. The work highlights the effectiveness of disentangled representations and zero-shot transfer for realistic, label-free emotional speech synthesis and provides a pathway toward scalable, nonparallel A2A emotion transfer applications.

Abstract

Given a pair of source and reference speech recordings, audio-to-audio (A2A) style transfer involves the generation of an output speech that mimics the style characteristics of the reference while preserving the content and speaker attributes of the source. In this paper, we propose a novel framework, termed as A2A Zero-shot Emotion Style Transfer (A2A-ZEST), that enables the transfer of reference emotional attributes to the source while retaining its speaker and speech contents. The A2A-ZEST framework consists of an analysis-synthesis pipeline, where the analysis module decomposes speech into semantic tokens, speaker representations, and emotion embeddings. Using these representations, a pitch contour estimator and a duration predictor are learned. Further, a synthesis module is designed to generate speech based on the input representations and the derived factors. This entire paradigm of analysis-synthesis is trained purely in a self-supervised manner with an auto-encoding loss. For A2A emotion style transfer, the emotion embedding extracted from the reference speech along with the rest of the representations from the source speech are used in the synthesis module to generate the style translated speech. In our experiments, we evaluate the converted speech on content/speaker preservation (w.r.t. source) as well as on the effectiveness of the emotion style transfer (w.r.t. reference). The proposal, A2A-ZEST, is shown to improve over other prior works on these evaluations, thereby enabling style transfer without any parallel training data. We also illustrate the application of the proposed work for data augmentation in emotion recognition tasks.

Paper Structure

This paper contains 32 sections, 7 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Overview of (a) A2A-ZEST training and (b) style transfer paradigm. Emotion style factors are colored differently from the rest. During style transfer, the source speech tokens are passed through the content and speaker encoder, while the duration predictor, F0 reconstruction module and emotion classifier modules receive their input from the reference speech.
  • Figure 2: The model for extracting the speaker embedding. GRL stands for Gradient Reversal Layer. The blue block is kept frozen during training.
  • Figure 3: Pitch contour reconstruction module - Speaker embedding ($\mathbf{s}$) is added with frame-level emotion embeddings $\mathbf{E}$ and forms the key-value sequence while the source speech token embeddings ($\mathbf{C}$) form the query sequence. The frame-level outputs from the cross-attention block are passed through a position-wise feedforward network using 1D-CNNs to reconstruct the pitch contour ($\mathbf{\hat{f}}$). The cross-attention block architecture is also expanded for reference.
  • Figure 4: The different factors that are derived from the speech in the analysis phase. The emotion classifier is trained with a speaker adversarial loss. The frame-level embeddings ($\mathbf{E}$), the speaker embedding ($\mathbf{s}$) and speech tokens ($\mathbf{t}$) are used to reconstruct the pitch contour ($\mathbf{\hat{f}}$). Further, the utterance-level emotion embedding ($\mathbf{\bar{e}}$) is used along with the de-duplicated tokens $\mathbf{t}^{'} = \{t_1,...,t_{T^{'}}\}$ to predict the duration of each of the tokens ($\mathbf{\hat{d}}$). All the blue blocks are kept frozen while the yellow blocks are trained. Grey blocks do not contain any learnable parameters.
  • Figure 5: Emotional style transfer - The frame-level ($\textbf{E}^{ref}$) and utterance-level ($\mathbf{\bar{e}}^{ref}$) embeddings are extracted from the reference speech. The duration prediction is performed using source tokens $\mathbf{t}^{'}$, speaker vector $\mathbf{s}$ and emotion embeddings $\mathbf{\bar{e}}^{ref}$. These predicted durations $\mathbf{\hat{d}}^{conv}$ are used to generate duplicated token sequence $\mathbf{t}^{conv}$. With this token sequence, $\mathbf{E}^{ref}$ and the speaker embedding $\mathbf{s}$, the $F_0$ contour is predicted, $\mathbf{\hat{f}}^{conv}$. Finally, the token sequence, speaker and emotion embeddings, and the predicted F0 contour are passed to the BigVGAN model to generate the converted speech.
  • ...and 4 more figures