Table of Contents
Fetching ...

A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation

Anna Min, Chenxu Hu, Yi Ren, Hang Zhao

TL;DR

This work tackles the challenge of preserving paralinguistic information in speech-to-speech translation by introducing an expressive English–Spanish movie dataset and a unit-based direct S2ST framework. It combines HuBERT-based discrete-unit encoding with unit-HiFiGAN synthesis, enabling global style transfer and local prosody/pitch control to maintain emotions without relying on intermediate text. Empirical results show improved emotion, emphasis, intonation, and rhythm preservation over vanilla unit-TTS, while achieving competitive translation quality; the dataset and methodology address data scarcity and facilitate future expressive S2ST research. Overall, the paper advances practical expressive S2ST by demonstrating that joint preservation of paralinguistic cues and translation accuracy is feasible using a carefully curated multimedia dataset and a unit-based synthesis pipeline.

Abstract

Current research in speech-to-speech translation (S2ST) primarily concentrates on translation accuracy and speech naturalness, often overlooking key elements like paralinguistic information, which is essential for conveying emotions and attitudes in communication. To address this, our research introduces a novel, carefully curated multilingual dataset from various movie audio tracks. Each dataset pair is precisely matched for paralinguistic information and duration. We enhance this by integrating multiple prosody transfer techniques, aiming for translations that are accurate, natural-sounding, and rich in paralinguistic details. Our experimental results confirm that our model retains more paralinguistic information from the source speech while maintaining high standards of translation accuracy and naturalness.

A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation

TL;DR

This work tackles the challenge of preserving paralinguistic information in speech-to-speech translation by introducing an expressive English–Spanish movie dataset and a unit-based direct S2ST framework. It combines HuBERT-based discrete-unit encoding with unit-HiFiGAN synthesis, enabling global style transfer and local prosody/pitch control to maintain emotions without relying on intermediate text. Empirical results show improved emotion, emphasis, intonation, and rhythm preservation over vanilla unit-TTS, while achieving competitive translation quality; the dataset and methodology address data scarcity and facilitate future expressive S2ST research. Overall, the paper advances practical expressive S2ST by demonstrating that joint preservation of paralinguistic cues and translation accuracy is feasible using a carefully curated multimedia dataset and a unit-based synthesis pipeline.

Abstract

Current research in speech-to-speech translation (S2ST) primarily concentrates on translation accuracy and speech naturalness, often overlooking key elements like paralinguistic information, which is essential for conveying emotions and attitudes in communication. To address this, our research introduces a novel, carefully curated multilingual dataset from various movie audio tracks. Each dataset pair is precisely matched for paralinguistic information and duration. We enhance this by integrating multiple prosody transfer techniques, aiming for translations that are accurate, natural-sounding, and rich in paralinguistic details. Our experimental results confirm that our model retains more paralinguistic information from the source speech while maintaining high standards of translation accuracy and naturalness.

Paper Structure

This paper contains 19 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Direct speech-to-speech translation system compared with cascaded speech-to-speech translation system: The green-colored pipeline above represents the traditional cascaded approach, which requires text as an intermediary. The approach below involves a discrete unit translation method, eliminating the need for text as an intermediary.
  • Figure 2: This is the length Distribution of the utterances, and the yellow ones denote the utterances that have a word error rate under 40%. There are 12610 utterances in total. The maximum duration is 244.250s, while the minimum duration is 0.833s. The average duration of utterances is 5.096s.
  • Figure 3: Our unit-HiFi-GAN-based voice style transfer model extracts local features from audio during both the training and inference stages. During inference, translated units and their corresponding reference waveforms are inputted into the model to synthesize the corresponding audio.