Table of Contents
Fetching ...

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Chenyang Le, Yao Qian, Dongmei Wang, Long Zhou, Shujie Liu, Xiaofei Wang, Midia Yousefi, Yanmin Qian, Jinyu Li, Sheng Zhao, Michael Zeng

TL;DR

This study introduces a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability, and proposes two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing.

Abstract

There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

TL;DR

This study introduces a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability, and proposes two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing.

Abstract

There is a rising interest and trend in research towards directly translating speech from one language to another, known as end-to-end speech-to-speech translation. However, most end-to-end models struggle to outperform cascade models, i.e., a pipeline framework by concatenating speech recognition, machine translation and text-to-speech models. The primary challenges stem from the inherent complexities involved in direct translation tasks and the scarcity of data. In this study, we introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion yet facilitates end-to-end inference through joint probability. Furthermore, we propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process, making it highly suitable for scenarios such as video dubbing. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
Paper Structure (42 sections, 3 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 42 sections, 3 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of our speech-to-speech translation framework: 1) Joint encoder-decoder model for translating speech into the target text, and coarse-grained speech tokens, $C_0$; 2) Non-autoregressive acoustic model for acoustic details, $C_{0:16}$; 3) Codec model to convert discrete speech tokens back to the waveform. Abbreviation: S/A/I(Semantic/Acoustic/Isochrony Information), $C_0$/$C_{0:16}$(Codec layer 0/0-15), S/A-Enc(Semantic/Acoustic Encoder), ICM(Isochrony Control Module).
  • Figure 2: The illustration of the training framework of the Joint Enc-Dec Model. During the training, the losses from the target speech clip, i.e., a sub-part of the whole target speech, which serves as a prompt, are not aggregated when computing the Cross-Entropy (CE) loss. The corresponding codec labels are masked in the implementation. The semantic encoder and the auto-regressive decoder are initialized by a SeamlessM4T X2T model . The semantic encoder is frozen during training. In inference, all the target speech input are replaced by source speech input.
  • Figure 3: Validation loss of TransVIP using different codecs.
  • Figure 4: The instruction of CMOS naturalness test