Table of Contents
Fetching ...

Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation

Jarod Duret, Yannick Estève, Titouan Parcollet

TL;DR

The paper addresses how to select discrete speech units for textless S2ST and argues that unit quality cannot be judged solely by resynthesis. It evaluates multiple self-supervised encoders (Wav2Vec 2.0, HuBERT, SAMU-XLSR) across diverse downstream tasks (ER, ASR, ASV, Synthesis) and a speech-to-unit translation setup using CVSS data, analyzing layer and cluster-size effects. The findings show a mismatch between features that optimize downstream tasks and those that maximize translation quality, highlighting the need for task-aware unit selection and possibly combining encoders. This work informs the design of robust textless S2ST systems and suggests directions for integrating semantic-aligned representations and multi-encoder configurations to handle varied linguistic and acoustic conditions.

Abstract

Recent advancements in textless speech-to-speech translation systems have been driven by the adoption of self-supervised learning techniques. Although most state-of-the-art systems adopt a similar architecture to transform source language speech into sequences of discrete representations in the target language, the criteria for selecting these target speech units remains an open question. This work explores the selection process through a study of downstream tasks such as automatic speech recognition, speech synthesis, speaker recognition, and emotion recognition. Interestingly, our findings reveal a discrepancy in the optimization of discrete speech units: units that perform well in resynthesis performance do not necessarily correlate with those that enhance translation efficacy. This discrepancy underscores the nuanced complexity of target feature selection and its impact on the overall performance of speech-to-speech translation systems.

Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation

TL;DR

The paper addresses how to select discrete speech units for textless S2ST and argues that unit quality cannot be judged solely by resynthesis. It evaluates multiple self-supervised encoders (Wav2Vec 2.0, HuBERT, SAMU-XLSR) across diverse downstream tasks (ER, ASR, ASV, Synthesis) and a speech-to-unit translation setup using CVSS data, analyzing layer and cluster-size effects. The findings show a mismatch between features that optimize downstream tasks and those that maximize translation quality, highlighting the need for task-aware unit selection and possibly combining encoders. This work informs the design of robust textless S2ST systems and suggests directions for integrating semantic-aligned representations and multi-encoder configurations to handle varied linguistic and acoustic conditions.

Abstract

Recent advancements in textless speech-to-speech translation systems have been driven by the adoption of self-supervised learning techniques. Although most state-of-the-art systems adopt a similar architecture to transform source language speech into sequences of discrete representations in the target language, the criteria for selecting these target speech units remains an open question. This work explores the selection process through a study of downstream tasks such as automatic speech recognition, speech synthesis, speaker recognition, and emotion recognition. Interestingly, our findings reveal a discrepancy in the optimization of discrete speech units: units that perform well in resynthesis performance do not necessarily correlate with those that enhance translation efficacy. This discrepancy underscores the nuanced complexity of target feature selection and its impact on the overall performance of speech-to-speech translation systems.
Paper Structure (13 sections, 1 figure, 5 tables)

This paper contains 13 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Effect of number of clusters BLEU scores. We adopt the baseline configuration, HuBERT Base at layer $6$ with (k=128, 512, 1024)