CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Qingkai Fang; Zhengrui Ma; Yan Zhou; Min Zhang; Yang Feng

CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Qingkai Fang, Zhengrui Ma, Yan Zhou, Min Zhang, Yang Feng

TL;DR

Addresses the slow decoding of direct speech-to-speech translation by adopting a CTC-based non-autoregressive textless S2ST model. The approach extends S2UT with a CTC-based non-autoregressive unit decoder and integrates encoder pretraining, knowledge distillation, glancing training, and non-monotonic latent alignments. The method achieves translation quality comparable to autoregressive S2UT while delivering up to $26.81\times$ decoding speedup, validating the effectiveness of a more powerful NAR formulation. This work enables fast, textless S2ST suitable for unwritten languages and demonstrates advantages over prior NAR S2ST methods.

Abstract

Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding due to the considerable length of speech sequences. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation. Experimental results demonstrate that by combining pretraining, knowledge distillation, and advanced NAR training techniques such as glancing training and non-monotonic latent alignments, CTC-based NAR models achieve translation quality comparable to the AR model, while preserving up to 26.81$\times$ decoding speedup.

CTC-based Non-autoregressive Textless Speech-to-Speech Translation

TL;DR

decoding speedup, validating the effectiveness of a more powerful NAR formulation. This work enables fast, textless S2ST suitable for unwritten languages and demonstrates advantages over prior NAR S2ST methods.

Abstract

decoding speedup.

Paper Structure (23 sections, 2 equations, 1 figure, 4 tables)

This paper contains 23 sections, 2 equations, 1 figure, 4 tables.

Introduction
Background
Speech-to-unit Translation
Connectionist Temporal Classification
Method
Model Architecture
Speech Encoder
Unit Decoder
Training
Encoder Pretraining
Knowledge Distillation
Glancing Training
Non-monotonic Latent Alignments
Experiments
Experimental Setups
...and 8 more sections

Figures (1)

Figure 1: Model architecture of CTC-S2UT.

CTC-based Non-autoregressive Textless Speech-to-Speech Translation

TL;DR

Abstract

CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Authors

TL;DR

Abstract

Table of Contents

Figures (1)