CTC-based Non-autoregressive Textless Speech-to-Speech Translation
Qingkai Fang, Zhengrui Ma, Yan Zhou, Min Zhang, Yang Feng
TL;DR
Addresses the slow decoding of direct speech-to-speech translation by adopting a CTC-based non-autoregressive textless S2ST model. The approach extends S2UT with a CTC-based non-autoregressive unit decoder and integrates encoder pretraining, knowledge distillation, glancing training, and non-monotonic latent alignments. The method achieves translation quality comparable to autoregressive S2UT while delivering up to $26.81\times$ decoding speedup, validating the effectiveness of a more powerful NAR formulation. This work enables fast, textless S2ST suitable for unwritten languages and demonstrates advantages over prior NAR S2ST methods.
Abstract
Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding due to the considerable length of speech sequences. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation. Experimental results demonstrate that by combining pretraining, knowledge distillation, and advanced NAR training techniques such as glancing training and non-monotonic latent alignments, CTC-based NAR models achieve translation quality comparable to the AR model, while preserving up to 26.81$\times$ decoding speedup.
