Table of Contents
Fetching ...

Transfer the linguistic representations from TTS to accent conversion with non-parallel data

Xi Chen, Jiakun Pei, Liumeng Xue, Mingyang Zhang

TL;DR

This work tackles non-parallel accent conversion by learning accent-agnostic linguistic representations through alignment between speech and TTS-derived linguistic states, enabling a reference-free, non-autoregressive framework. The model comprises a Speech Encoding Block, a training-time Textual Auxiliary Block for alignment, and a Decoder, with three-stage training that first learns TTS, then aligns speech to linguistic representations, and finally fine-tunes for accent conversion. Key contributions include the use of TTS-based alignment to learn accent-independent content, exploration of input features (mel vs Whisper) and pretraining on native data, and a thorough evaluation showing improvements in audio quality and intelligibility over baselines. The approach, validated on LibriTTS data and Hindi L2ARCTIC speakers, enables effective accent conversion without parallel data and has practical implications for pronunciation learning, dubbing, and robust ASR in multilingual scenarios. The framework is expressed with $M^t=\mathcal{F}(\mathbf{A})$ and leverages $H^s$, $H^t$, and $H^l$ representations to separate content from accent for high-quality, natural-sounding speech in target accents.

Abstract

Accent conversion aims to convert the accent of a source speech to a target accent, meanwhile preserving the speaker's identity. This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech. Specifically, the proposed system aligns speech representations with linguistic representations obtained from Text-to-Speech (TTS) systems, enabling training of the accent voice conversion model on non-parallel data. Furthermore, we investigate the effectiveness of a pretraining strategy on native data and different acoustic features within our proposed framework. We conduct a comprehensive evaluation using both subjective and objective metrics to assess the performance of our approach. The evaluation results highlight the benefits of the pretraining strategy and the incorporation of richer semantic features, resulting in significantly enhanced audio quality and intelligibility.

Transfer the linguistic representations from TTS to accent conversion with non-parallel data

TL;DR

This work tackles non-parallel accent conversion by learning accent-agnostic linguistic representations through alignment between speech and TTS-derived linguistic states, enabling a reference-free, non-autoregressive framework. The model comprises a Speech Encoding Block, a training-time Textual Auxiliary Block for alignment, and a Decoder, with three-stage training that first learns TTS, then aligns speech to linguistic representations, and finally fine-tunes for accent conversion. Key contributions include the use of TTS-based alignment to learn accent-independent content, exploration of input features (mel vs Whisper) and pretraining on native data, and a thorough evaluation showing improvements in audio quality and intelligibility over baselines. The approach, validated on LibriTTS data and Hindi L2ARCTIC speakers, enables effective accent conversion without parallel data and has practical implications for pronunciation learning, dubbing, and robust ASR in multilingual scenarios. The framework is expressed with and leverages , , and representations to separate content from accent for high-quality, natural-sounding speech in target accents.

Abstract

Accent conversion aims to convert the accent of a source speech to a target accent, meanwhile preserving the speaker's identity. This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech. Specifically, the proposed system aligns speech representations with linguistic representations obtained from Text-to-Speech (TTS) systems, enabling training of the accent voice conversion model on non-parallel data. Furthermore, we investigate the effectiveness of a pretraining strategy on native data and different acoustic features within our proposed framework. We conduct a comprehensive evaluation using both subjective and objective metrics to assess the performance of our approach. The evaluation results highlight the benefits of the pretraining strategy and the incorporation of richer semantic features, resulting in significantly enhanced audio quality and intelligibility.
Paper Structure (15 sections, 4 equations, 2 figures, 1 table)

This paper contains 15 sections, 4 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Model Architecture
  • Figure 2: Speaker similarity preference test results.