NAIST Simultaneous Speech Translation System for IWSLT 2024
Yuka Ko, Ryo Fukuda, Yuta Nishikawa, Yasumasa Kano, Tomoya Yanagita, Kosuke Doi, Mana Makinae, Haotian Tan, Makoto Sakai, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura
TL;DR
This work tackles simultaneous translation for English↔German/Japanese/Chinese in text and English→Japanese speech-to-speech by using a multilingual end-to-end ST model built from HuBERT and mBART50, and by evaluating two decoding policies, Local Agreement and AlignAtt, under non-computation-aware and computation-aware latency regimes. It introduces an incremental TTS cascade with a Transformer-based phoneme/prosody estimator and a Parallel WaveGAN vocoder, and shows that upgrading the TTS component yields notable gains in S2S ASR_BLEU and naturalness. The authors provide a thorough empirical comparison across policies, data configurations, and latency settings, demonstrating that LA delivers higher quality in many practical conditions, while AlignAtt can outperform LA in low-latency, computation-aware scenarios. Overall, the system achieves improved performance over the NAIST 2023 submission and offers insights into policy choice, data augmentation via bilingual prefix alignment, and incremental speech synthesis for real-time translation applications.
Abstract
This paper describes NAIST's submission to the simultaneous track of the IWSLT 2024 Evaluation Campaign: English-to-{German, Japanese, Chinese} speech-to-text translation and English-to-Japanese speech-to-speech translation. We develop a multilingual end-to-end speech-to-text translation model combining two pre-trained language models, HuBERT and mBART. We trained this model with two decoding policies, Local Agreement (LA) and AlignAtt. The submitted models employ the LA policy because it outperformed the AlignAtt policy in previous models. Our speech-to-speech translation method is a cascade of the above speech-to-text model and an incremental text-to-speech (TTS) module that incorporates a phoneme estimation model, a parallel acoustic model, and a parallel WaveGAN vocoder. We improved our incremental TTS by applying the Transformer architecture with the AlignAtt policy for the estimation model. The results show that our upgraded TTS module contributed to improving the system performance.
