Table of Contents
Fetching ...

Textless Speech-to-Speech Translation With Limited Parallel Data

Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi

TL;DR

PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data, is presented, which pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data, and train with an unsupervised backtranslation objective.

Abstract

Existing speech-to-speech translation (S2ST) models fall into two camps: they either leverage text as an intermediate step or require hundreds of hours of parallel speech data. Both approaches are incompatible with textless languages or language pairs with limited parallel data. We present PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We first pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data (20-60 hours), and lastly train with an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech. Evaluated using the ASR-BLEU metric, our models achieve reasonable performance on all three domains, with some being within 1-2 points of our higher-resourced topline.

Textless Speech-to-Speech Translation With Limited Parallel Data

TL;DR

PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data, is presented, which pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data, and train with an unsupervised backtranslation objective.

Abstract

Existing speech-to-speech translation (S2ST) models fall into two camps: they either leverage text as an intermediate step or require hundreds of hours of parallel speech data. Both approaches are incompatible with textless languages or language pairs with limited parallel data. We present PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We first pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data (20-60 hours), and lastly train with an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech. Evaluated using the ASR-BLEU metric, our models achieve reasonable performance on all three domains, with some being within 1-2 points of our higher-resourced topline.
Paper Structure (42 sections, 4 figures, 9 tables)

This paper contains 42 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of speech-to-speech translation systems. We compare our formulation to two relevant lines of work. We present the first textless speech-to-speech system that does not require a large parallel training dataset.
  • Figure 2: Training a unit-based encoder-decoder model for S2ST. The first Pretrain step trains on large-scale monolingual speech data using a denoising pretraining loss. The second Finetune step trains on low-resource parallel speech translation data using a supervised finetuning loss. The third Backtranslate step trains using the round-trip consistency loss (on monolingual data) and supervised finetuning replay (on parallel data).
  • Figure 3: PNMI vs. layer index, comparing different clustering settings for English and German. Higher is better.
  • Figure 4: PNMI with HuBERT and Indic wav2vec2.0 evaluated on Shrutilipi, computed for different layer indices, for Marathi. Higher is better.