Table of Contents
Fetching ...

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

Rui Zhou, Akinori Ito, Takashi Nose

TL;DR

The paper tackles preserving source speaker identity in direct speech-to-speech translation using a non-autoregressive S2UT framework. It introduces a self-supervised pre-training regime for the speaker adapter and unit-to-mel module, along with multiple feature-fusion strategies, to address unit-speaker mismatches and improve speaker consistency. Empirical results on CVSS-T ES-EN and FR-EN show BLEU gains, higher MOS estimates, and improved speaker similarity while maintaining near-end-to-end efficiency relative to S2UT baselines and traditional cascades. Cross-attention fusion and embedding-based or pre-training based speaker representations yield strong performance gains, with the approach demonstrating competitive translation quality and real-time capability, and generalizing to at least German-English. The work advances practical, speaker-preserving S2ST through careful pre-training, embedding choices, and fusion design, enabling more natural and faithful multilingual speech translation.

Abstract

Speech-to-Speech Translation (S2ST) refers to the conversion of speech in one language into semantically equivalent speech in another language, facilitating communication between speakers of different languages. Speech-to-Discrete Unit Translation (S2UT), a mainstream approach for end-to-end S2ST, addresses challenges such as error propagation across modules and slow inference speed often encountered in traditional cascade systems. However, as discrete units primarily capture content information, conventional S2UT methods fail to retain speaker-specific characteristics from the source. Our previous work, SC-S2UT, introduced a speaker adapter and a unit-to-mel structure, enabling the preservation of speaker information and non-autoregressive speech generation. Building on this foundation, this study proposes a self-supervised pretraining method to enrich the information extracted by both the speaker adapter and the unit-to-mel structure. Additionally, we investigate different feature fusion strategies to further improve the integration of speaker and content features. Experiments conducted on the CVSS-T dataset for ES-EN and FR-EN tasks demonstrate that our proposed method achieves a BLEU score improvement of 1.14 compared to SC-S2UT, along with significant enhancements in MOS and speaker similarity. Furthermore, our approach achieves translation quality comparable to traditional S2UT, with only a minimal increase of 0.04s per utterance in inference time, while maintaining high speaker similarity. These results validate the effectiveness of the proposed method.

Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining

TL;DR

The paper tackles preserving source speaker identity in direct speech-to-speech translation using a non-autoregressive S2UT framework. It introduces a self-supervised pre-training regime for the speaker adapter and unit-to-mel module, along with multiple feature-fusion strategies, to address unit-speaker mismatches and improve speaker consistency. Empirical results on CVSS-T ES-EN and FR-EN show BLEU gains, higher MOS estimates, and improved speaker similarity while maintaining near-end-to-end efficiency relative to S2UT baselines and traditional cascades. Cross-attention fusion and embedding-based or pre-training based speaker representations yield strong performance gains, with the approach demonstrating competitive translation quality and real-time capability, and generalizing to at least German-English. The work advances practical, speaker-preserving S2ST through careful pre-training, embedding choices, and fusion design, enabling more natural and faithful multilingual speech translation.

Abstract

Speech-to-Speech Translation (S2ST) refers to the conversion of speech in one language into semantically equivalent speech in another language, facilitating communication between speakers of different languages. Speech-to-Discrete Unit Translation (S2UT), a mainstream approach for end-to-end S2ST, addresses challenges such as error propagation across modules and slow inference speed often encountered in traditional cascade systems. However, as discrete units primarily capture content information, conventional S2UT methods fail to retain speaker-specific characteristics from the source. Our previous work, SC-S2UT, introduced a speaker adapter and a unit-to-mel structure, enabling the preservation of speaker information and non-autoregressive speech generation. Building on this foundation, this study proposes a self-supervised pretraining method to enrich the information extracted by both the speaker adapter and the unit-to-mel structure. Additionally, we investigate different feature fusion strategies to further improve the integration of speaker and content features. Experiments conducted on the CVSS-T dataset for ES-EN and FR-EN tasks demonstrate that our proposed method achieves a BLEU score improvement of 1.14 compared to SC-S2UT, along with significant enhancements in MOS and speaker similarity. Furthermore, our approach achieves translation quality comparable to traditional S2UT, with only a minimal increase of 0.04s per utterance in inference time, while maintaining high speaker similarity. These results validate the effectiveness of the proposed method.

Paper Structure

This paper contains 24 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Speaker retention unit-to-mel based speaker consistency S2UT
  • Figure 2: The workflow of pre-training and finetuning of SR-U2M module using the self-supervised learning
  • Figure 3: Illustration of different feature fusion methods
  • Figure 4: mel-spectrograms with F0 contours overlaid (in red) for three utterances. Pretrain SC-S2UT shows better pitch continuity and spectral structure, closely matching the ground truth (CVSS-T), while baseline systems such as S2UT and SC-S2UT exhibit pitch discontinuities and blurred harmonics.