Table of Contents
Fetching ...

Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

Sina Rashidi, Hossein Sameti

TL;DR

Direct S2ST for low-resource languages like Persian–English is data-hungry and under-resourced. The authors propose a three-component direct S2ST system with a Conformer-based encoder, a HuBERT-derived discrete-unit decoder, and a unit vocoder, augmented by a synthetic Persian–English parallel corpus created via GPT-4o translations and zero-shot VoiceCraft TTS. On CVSS Fa–En, the model achieves up to 4.6 BLEU gains over strong direct baselines when trained with synthetic data, with encoder pretraining and discrete-unit modeling contributing significantly. This work shows that self-supervised pretraining, discrete-unit representations, and scalable synthetic data can make direct S2ST viable for low-resource pairs, with practical impact on audio dubbing.

Abstract

Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English

Improving Direct Persian-English Speech-to-Speech Translation with Discrete Units and Synthetic Parallel Data

TL;DR

Direct S2ST for low-resource languages like Persian–English is data-hungry and under-resourced. The authors propose a three-component direct S2ST system with a Conformer-based encoder, a HuBERT-derived discrete-unit decoder, and a unit vocoder, augmented by a synthetic Persian–English parallel corpus created via GPT-4o translations and zero-shot VoiceCraft TTS. On CVSS Fa–En, the model achieves up to 4.6 BLEU gains over strong direct baselines when trained with synthetic data, with encoder pretraining and discrete-unit modeling contributing significantly. This work shows that self-supervised pretraining, discrete-unit representations, and scalable synthetic data can make direct S2ST viable for low-resource pairs, with practical impact on audio dubbing.

Abstract

Direct speech-to-speech translation (S2ST), in which all components are trained jointly, is an attractive alternative to cascaded systems because it offers a simpler pipeline and lower inference latency. However, direct S2ST models require large amounts of parallel speech data in the source and target languages, which are rarely available for low-resource languages such as Persian. This paper presents a direct S2ST system for translating Persian speech into English speech, as well as a pipeline for synthetic parallel Persian-English speech generation. The model comprises three components: (1) a conformer-based encoder, initialized from self-supervised pre-training, maps source speech to high-level acoustic representations; (2) a causal transformer decoder with relative position multi-head attention translates these representations into discrete target speech units; (3) a unit-based neural vocoder generates waveforms from the predicted discrete units. To mitigate the data scarcity problem, we construct a new Persian-English parallel speech corpus by translating Persian speech transcriptions into English using a large language model and then synthesizing the corresponding English speech with a state-of-the-art zero-shot text-to-speech system. The resulting corpus increases the amount of available parallel speech by roughly a factor of six. On the Persian-English portion of the CVSS corpus, the proposed model achieves improvement of 4.6 ASR BLEU with the synthetic data over direct baselines. These results indicate that combining self-supervised pre-training, discrete speech units, and synthetic parallel data is effective for improving direct S2ST in low-resource language pairs such as Persian-English

Paper Structure

This paper contains 15 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Demonstration of a direct speech-to-speech translation pipeline
  • Figure 2: Architecture of the proposed model