Table of Contents
Fetching ...

NAIST Simultaneous Speech Translation System for IWSLT 2024

Yuka Ko, Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Tomoya Yanagita, Kosuke Doi, Mana Makinae, Haotian Tan, Makoto Sakai, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura

TL;DR

This work tackles simultaneous translation for English↔German/Japanese/Chinese in text and English→Japanese speech-to-speech by using a multilingual end-to-end ST model built from HuBERT and mBART50, and by evaluating two decoding policies, Local Agreement and AlignAtt, under non-computation-aware and computation-aware latency regimes. It introduces an incremental TTS cascade with a Transformer-based phoneme/prosody estimator and a Parallel WaveGAN vocoder, and shows that upgrading the TTS component yields notable gains in S2S ASR_BLEU and naturalness. The authors provide a thorough empirical comparison across policies, data configurations, and latency settings, demonstrating that LA delivers higher quality in many practical conditions, while AlignAtt can outperform LA in low-latency, computation-aware scenarios. Overall, the system achieves improved performance over the NAIST 2023 submission and offers insights into policy choice, data augmentation via bilingual prefix alignment, and incremental speech synthesis for real-time translation applications.

Abstract

This paper describes NAIST's submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-{German, Japanese, Chinese} speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.

NAIST Simultaneous Speech Translation System for IWSLT 2024

TL;DR

This work tackles simultaneous translation for English↔German/Japanese/Chinese in text and English→Japanese speech-to-speech by using a multilingual end-to-end ST model built from HuBERT and mBART50, and by evaluating two decoding policies, Local Agreement and AlignAtt, under non-computation-aware and computation-aware latency regimes. It introduces an incremental TTS cascade with a Transformer-based phoneme/prosody estimator and a Parallel WaveGAN vocoder, and shows that upgrading the TTS component yields notable gains in S2S ASR_BLEU and naturalness. The authors provide a thorough empirical comparison across policies, data configurations, and latency settings, demonstrating that LA delivers higher quality in many practical conditions, while AlignAtt can outperform LA in low-latency, computation-aware scenarios. Overall, the system achieves improved performance over the NAIST 2023 submission and offers insights into policy choice, data augmentation via bilingual prefix alignment, and incremental speech synthesis for real-time translation applications.

Abstract

This paper describes NAIST's submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-{German, Japanese, Chinese} speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.
Paper Structure (29 sections, 4 figures, 4 tables)

This paper contains 29 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Results of Local Agreement and AlignAtt policies with AL on the speech-to-text systems. Circled dot in LA graph indicates our submitted system. Circled dot in AlignAtt graph indicates the best model satisfying the task requirement of IWSLT 2024 Shared Task.
  • Figure 9: Results of Local Agreement and AlignAtt policies with ATD, Start_Offset, and End_Offset on speech-to-speech systems. Circled dot in LA graph indicates submitted system. Circled dot in AlignAtt graph indicates the best model satisfying the task requirement of IWSLT 2024 Shared Task.
  • Figure 17: LA (chunk size = 950 ms)
  • Figure 18: AlignAtt (chunk size = 800 ms, $f=6$)