Table of Contents
Fetching ...

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

Ishan D. Biyani, Nirmesh J. Shah, Ashishkumar P. Gudmalwar, Pankaj Wasnik, Rajiv R. Shah

TL;DR

This work tackles speaker-identity disentanglement in diffusion-based voice conversion under limited target data by exploiting full-utterance speech time reversal (STR) as a data-augmentation strategy. The authors fuse conventional speaker embeddings with STR-derived embeddings using a weighted scheme within a diffusion-based VC framework (DDDM-VC), leveraging self-supervised content, pitch, and speaker representations. Perceptual studies show STR preserves speaker identity despite intelligibility loss, and experiments on LibriTTS and VCTK demonstrate notable gains in speaker similarity while maintaining high speech quality, outperforming several SOTA baselines. An ablation confirms equal weighting of the two embeddings and the value of cross-attention for further improvements. This approach offers a data-efficient pathway to enhance zero-shot VC and invites exploration of unconventional transforms in speech representation learning.

Abstract

Speech time reversal refers to the process of reversing the entire speech signal in time, causing it to play backward. Such signals are completely unintelligible since the fundamental structures of phonemes and syllables are destroyed. However, they still retain tonal patterns that enable perceptual speaker identification despite losing linguistic content. In this paper, we propose leveraging speaker representations learned from time reversed speech as an augmentation strategy to enhance speaker representation. Notably, speaker and language disentanglement in voice conversion (VC) is essential to accurately preserve a speaker's unique vocal traits while minimizing interference from linguistic content. The effectiveness of the proposed approach is evaluated in the context of state-of-the-art diffusion-based VC models. Experimental results indicate that the proposed approach significantly improves speaker similarity-related scores while maintaining high speech quality.

REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion

TL;DR

This work tackles speaker-identity disentanglement in diffusion-based voice conversion under limited target data by exploiting full-utterance speech time reversal (STR) as a data-augmentation strategy. The authors fuse conventional speaker embeddings with STR-derived embeddings using a weighted scheme within a diffusion-based VC framework (DDDM-VC), leveraging self-supervised content, pitch, and speaker representations. Perceptual studies show STR preserves speaker identity despite intelligibility loss, and experiments on LibriTTS and VCTK demonstrate notable gains in speaker similarity while maintaining high speech quality, outperforming several SOTA baselines. An ablation confirms equal weighting of the two embeddings and the value of cross-attention for further improvements. This approach offers a data-efficient pathway to enhance zero-shot VC and invites exploration of unconventional transforms in speech representation learning.

Abstract

Speech time reversal refers to the process of reversing the entire speech signal in time, causing it to play backward. Such signals are completely unintelligible since the fundamental structures of phonemes and syllables are destroyed. However, they still retain tonal patterns that enable perceptual speaker identification despite losing linguistic content. In this paper, we propose leveraging speaker representations learned from time reversed speech as an augmentation strategy to enhance speaker representation. Notably, speaker and language disentanglement in voice conversion (VC) is essential to accurately preserve a speaker's unique vocal traits while minimizing interference from linguistic content. The effectiveness of the proposed approach is evaluated in the context of state-of-the-art diffusion-based VC models. Experimental results indicate that the proposed approach significantly improves speaker similarity-related scores while maintaining high speech quality.

Paper Structure

This paper contains 11 sections, 2 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Confusion matrices for the perceptual study of speaker identification from the time reversed speech. Here, M1, M2, M3 and F1, F2, F3 represents three different Male and Female speakers, respectively.
  • Figure 2: Spectrographic visualization of (a) original speech (b) 20 ms short-time, (c) 100 ms short-time speech reversal, and (d) complete speech time reversal.
  • Figure 3: Blockdiagram of the propose approach in Diffusion-based Voice Conversion
  • Figure 4: Ablation analysis for subjective speaker similarity scores w.r.t. Baseline(DiffVC)