Table of Contents
Fetching ...

TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data

Seung-Bin Kim, Sang-Hoon Lee, Seong-Whan Lee

TL;DR

TranSentence is introduced, a novel speech-to-speech translation without language-parallel speech data that can generate target language speech in the inference stage using language-agnostic speech embedding from the source language speech.

Abstract

Although there has been significant advancement in the field of speech-to-speech translation, conventional models still require language-parallel speech data between the source and target languages for training. In this paper, we introduce TranSentence, a novel speech-to-speech translation without language-parallel speech data. To achieve this, we first adopt a language-agnostic sentence-level speech encoding that captures the semantic information of speech, irrespective of language. We then train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder that is pre-trained with various languages. With this method, despite training exclusively on the target language's monolingual data, we can generate target language speech in the inference stage using language-agnostic speech embedding from the source language speech. Furthermore, we extend TranSentence to multilingual speech-to-speech translation. The experimental results demonstrate that TranSentence is superior to other models.

TranSentence: Speech-to-speech Translation via Language-agnostic Sentence-level Speech Encoding without Language-parallel Data

TL;DR

TranSentence is introduced, a novel speech-to-speech translation without language-parallel speech data that can generate target language speech in the inference stage using language-agnostic speech embedding from the source language speech.

Abstract

Although there has been significant advancement in the field of speech-to-speech translation, conventional models still require language-parallel speech data between the source and target languages for training. In this paper, we introduce TranSentence, a novel speech-to-speech translation without language-parallel speech data. To achieve this, we first adopt a language-agnostic sentence-level speech encoding that captures the semantic information of speech, irrespective of language. We then train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder that is pre-trained with various languages. With this method, despite training exclusively on the target language's monolingual data, we can generate target language speech in the inference stage using language-agnostic speech embedding from the source language speech. Furthermore, we extend TranSentence to multilingual speech-to-speech translation. The experimental results demonstrate that TranSentence is superior to other models.
Paper Structure (20 sections, 2 equations, 2 figures, 5 tables)

This paper contains 20 sections, 2 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of the TranSentence. The pre-trained language-agnostic sentence-level speech encoder generates speech embedding, and based on this, TranSentence is trained to reconstruct speech in the target (tgt) language. During the inference process, TranSentence generates speech in the target language from the speech embedding of the source (src) language speech.
  • Figure 2: The visualization of language-agnostic sentence-level speech embeddings using t-SNE. Each symbol represents the language of the speech, and speeches with same meanings share the identical color.