Table of Contents
Fetching ...

MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction

Jianjin Wang, Runsong Zhao, Xiaoqian Liu, Yuan Ge, Ziqiang Xu, Tong Xiao, Shengxiang Gao, Zhengtao Yu, Jingbo Zhu

TL;DR

The paper tackles limited semantic density in direct speech-to-speech translation by introducing multi-token prediction (MTP) and applying an MTP-S2UT loss at an intermediate CTC layer to fuse speech and text earlier. MTP predicts the subsequent $N$ tokens per position (with $N=7$ in experiments) and is implemented in several variants; the MTP-S2UT loss specifically targets the CTC layer to enrich hidden representations. Across French→English and Spanish→English on the CVSS-C benchmark, MTP-S2UT yields consistent improvements in ASR-BLEU across tokenizers and decoding methods, with the strongest gains when applying MTP early. Analyses show that MTP causes a forward shift in CTC alignments and reduces prediction uncertainty for speech tokens, illustrating more efficient semantic planning and cross-modal fusion. Overall, the work demonstrates that early intermediate-layer MTP enrichment substantially boosts direct S2UT performance and provides a path for further advancements in cross-modal sequence modeling.

Abstract

Current direct speech-to-speech translation methods predominantly employ speech tokens as intermediate representations. However, a single speech token is not dense in semantics, so we generally need multiple tokens to express a complete semantic unit. To address this limitation, we introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models, enabling models to predict multiple subsequent tokens at each position, thereby capturing more complete semantics and enhancing information density per position. Initial MTP implementations apply the loss at the final layer, which improves output representation but initiates information enrichment too late. We hypothesize that advancing the information enrichment process to intermediate layers can achieve earlier and more effective enhancement of hidden representation. Consequently, we propose MTP-S2UT loss, applying MTP loss to hidden representation where CTC loss is computed. Experiments demonstrate that all MTP loss variants consistently improve the quality of S2UT translation, with MTP-S2UT achieving the best performance.

MTP-S2UT: Enhancing Speech-to-Speech Translation Quality with Multi-token Prediction

TL;DR

The paper tackles limited semantic density in direct speech-to-speech translation by introducing multi-token prediction (MTP) and applying an MTP-S2UT loss at an intermediate CTC layer to fuse speech and text earlier. MTP predicts the subsequent tokens per position (with in experiments) and is implemented in several variants; the MTP-S2UT loss specifically targets the CTC layer to enrich hidden representations. Across French→English and Spanish→English on the CVSS-C benchmark, MTP-S2UT yields consistent improvements in ASR-BLEU across tokenizers and decoding methods, with the strongest gains when applying MTP early. Analyses show that MTP causes a forward shift in CTC alignments and reduces prediction uncertainty for speech tokens, illustrating more efficient semantic planning and cross-modal fusion. Overall, the work demonstrates that early intermediate-layer MTP enrichment substantially boosts direct S2UT performance and provides a path for further advancements in cross-modal sequence modeling.

Abstract

Current direct speech-to-speech translation methods predominantly employ speech tokens as intermediate representations. However, a single speech token is not dense in semantics, so we generally need multiple tokens to express a complete semantic unit. To address this limitation, we introduce multi-token prediction (MTP) loss into speech-to-unit translation (S2UT) models, enabling models to predict multiple subsequent tokens at each position, thereby capturing more complete semantics and enhancing information density per position. Initial MTP implementations apply the loss at the final layer, which improves output representation but initiates information enrichment too late. We hypothesize that advancing the information enrichment process to intermediate layers can achieve earlier and more effective enhancement of hidden representation. Consequently, we propose MTP-S2UT loss, applying MTP loss to hidden representation where CTC loss is computed. Experiments demonstrate that all MTP loss variants consistently improve the quality of S2UT translation, with MTP-S2UT achieving the best performance.

Paper Structure

This paper contains 16 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of S2UT model and our implementation of 4 MTP loss variants on the S2UT model.
  • Figure 2: Example of CTC output sequences decoded from the intermediate hidden states $H^m_{dec}$. Compared to NTP loss, the model trained with MTP loss produces text tokens $y_1$ that exhibit an overall forward shift. After each text token's first occurrence, subsequent tokens can access its semantic information, thus we use the first occurrence position to represent the earliest availability of its semantic information. For instance, the first occurrence positions of $y_1$ and $y_2$ from the model trained with MTP loss are 12.5% and 62.5%, respectively. Complete statistics are provided in Table \ref{['cvssv_ctc']}.
  • Figure 3: Entropy distribution of 1.2M speech token predictions in S2UT models trained with MTP loss. Speech tokens are from the CVSS-C Fr$\to$En test set using the $\mathcal{S}^3$ tokenizer. All frequencies are presented relative to the baseline (with NTP loss) for enhanced visualization clarity.