Table of Contents
Fetching ...

Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems

Daniel Platnick, Bishoy Abdelnour, Eamon Earl, Rahul Kumar, Zahra Rezaei, Thomas Tsangaris, Faraj Lagum

TL;DR

This work addresses privacy and regulatory concerns in speech-to-speech translation by proposing Preset-Voice Matching (PVM), a framework that avoids cloning an unknown input voice by matching it to a consented preset-voice in the target language. The GEMO-Match algorithm implements PVM via a hierarchical gender-dependent emotion classifier to drive preset-voice selection and downstream TTS, enabling multilingual S2ST with improved naturalness and reduced run-time. The authors introduce the Combined Gender-Dependent Dataset (CGDD) to improve generalization over existing benchmarks and demonstrate robustness, multilingual capability, and favorable run-time in experiments across English-to-French and English-to-German translations. The results indicate that PVM enhances safety and commercial viability of S2ST systems while maintaining or improving speech quality, and the approach offers scalable regulatory adaptations for industry deployment.

Abstract

In recent years, there has been increased demand for speech-to-speech translation (S2ST) systems in industry settings. Although successfully commercialized, cloning-based S2ST systems expose their distributors to liabilities when misused by individuals and can infringe on personality rights when exploited by media organizations. This work proposes a regulated S2ST framework called Preset-Voice Matching (PVM). PVM removes cross-lingual voice cloning in S2ST by first matching the input voice to a similar prior consenting speaker voice in the target-language. With this separation, PVM avoids cloning the input speaker, ensuring PVM systems comply with regulations and reduce risk of misuse. Our results demonstrate PVM can significantly improve S2ST system run-time in multi-speaker settings and the naturalness of S2ST synthesized speech. To our knowledge, PVM is the first explicitly regulated S2ST framework leveraging similarly-matched preset-voices for dynamic S2ST tasks.

Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems

TL;DR

This work addresses privacy and regulatory concerns in speech-to-speech translation by proposing Preset-Voice Matching (PVM), a framework that avoids cloning an unknown input voice by matching it to a consented preset-voice in the target language. The GEMO-Match algorithm implements PVM via a hierarchical gender-dependent emotion classifier to drive preset-voice selection and downstream TTS, enabling multilingual S2ST with improved naturalness and reduced run-time. The authors introduce the Combined Gender-Dependent Dataset (CGDD) to improve generalization over existing benchmarks and demonstrate robustness, multilingual capability, and favorable run-time in experiments across English-to-French and English-to-German translations. The results indicate that PVM enhances safety and commercial viability of S2ST systems while maintaining or improving speech quality, and the approach offers scalable regulatory adaptations for industry deployment.

Abstract

In recent years, there has been increased demand for speech-to-speech translation (S2ST) systems in industry settings. Although successfully commercialized, cloning-based S2ST systems expose their distributors to liabilities when misused by individuals and can infringe on personality rights when exploited by media organizations. This work proposes a regulated S2ST framework called Preset-Voice Matching (PVM). PVM removes cross-lingual voice cloning in S2ST by first matching the input voice to a similar prior consenting speaker voice in the target-language. With this separation, PVM avoids cloning the input speaker, ensuring PVM systems comply with regulations and reduce risk of misuse. Our results demonstrate PVM can significantly improve S2ST system run-time in multi-speaker settings and the naturalness of S2ST synthesized speech. To our knowledge, PVM is the first explicitly regulated S2ST framework leveraging similarly-matched preset-voices for dynamic S2ST tasks.
Paper Structure (27 sections, 2 figures, 3 tables)

This paper contains 27 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Comparative processing times of different models. OpenVoice's tone extractor and GEMO-Match are distinguished from their TTS processing times.
  • Figure 2: The OpenVoice tone extractor post-processes every TTS output. GEMO-Match only needs to re-run on the arrival of a different speaker from the one present in the previous input.