REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion
Ishan D. Biyani, Nirmesh J. Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik, Rajiv R. Shah
TL;DR
This work tackles speaker-identity disentanglement in diffusion-based voice conversion under limited target data by exploiting full-utterance speech time reversal (STR) as a data-augmentation strategy. The authors fuse conventional speaker embeddings with STR-derived embeddings using a weighted scheme within a diffusion-based VC framework (DDDM-VC), leveraging self-supervised content, pitch, and speaker representations. Perceptual studies show STR preserves speaker identity despite intelligibility loss, and experiments on LibriTTS and VCTK demonstrate notable gains in speaker similarity while maintaining high speech quality, outperforming several SOTA baselines. An ablation confirms equal weighting of the two embeddings and the value of cross-attention for further improvements. This approach offers a data-efficient pathway to enhance zero-shot VC and invites exploration of unconventional transforms in speech representation learning.
Abstract
Speech time reversal refers to the process of reversing the entire speech signal in time, causing it to play backward. Such signals are completely unintelligible since the fundamental structures of phonemes and syllables are destroyed. However, they still retain tonal patterns that enable perceptual speaker identification despite losing linguistic content. In this paper, we propose leveraging speaker representations learned from time reversed speech as an augmentation strategy to enhance speaker representation. Notably, speaker and language disentanglement in voice conversion (VC) is essential to accurately preserve a speaker's unique vocal traits while minimizing interference from linguistic content. The effectiveness of the proposed approach is evaluated in the context of state-of-the-art diffusion-based VC models. Experimental results indicate that the proposed approach significantly improves speaker similarity-related scores while maintaining high speech quality.
