Table of Contents
Fetching ...

MT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs

Zewen Chi, Li Dong, Shuming Ma, Shaohan Huang Xian-Ling Mao, Heyan Huang, Furu Wei

TL;DR

<3-5 sentence high-level summary> MT6 introduces a translation-data–driven extension to multilingual text-to-text pretraining by adding three cross-lingual tasks (MT, TPSC, TSC) and a partially non-autoregressive decoding objective. It unifies pre-training and fine-tuning under a text-to-text framework and demonstrates substantial gains over mT5 on XTREME benchmarks and multilingual generation tasks, while also analyzing cross-lingual representations and alignments. The work shows that combining translation-based objectives with PNAT reduces transfer gaps and improves token-level alignment, suggesting translation data as a practical lever for stronger multilingual models. Overall, mT6 advances cross-lingual transfer and generation, highlighting scalable pathways for multilingual NLP systems.

Abstract

Multilingual T5 (mT5) pretrains a sequence-to-sequence model on massive monolingual texts, which has shown promising results on many cross-lingual tasks. In this paper, we improve multilingual text-to-text transfer Transformer with translation pairs (mT6). Specifically, we explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption. In addition, we propose a partially non-autoregressive objective for text-to-text pre-training. We evaluate the methods on eight multilingual benchmark datasets, including sentence classification, named entity recognition, question answering, and abstractive summarization. Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.

MT6: Multilingual Pretrained Text-to-Text Transformer with Translation Pairs

TL;DR

<3-5 sentence high-level summary> MT6 introduces a translation-data–driven extension to multilingual text-to-text pretraining by adding three cross-lingual tasks (MT, TPSC, TSC) and a partially non-autoregressive decoding objective. It unifies pre-training and fine-tuning under a text-to-text framework and demonstrates substantial gains over mT5 on XTREME benchmarks and multilingual generation tasks, while also analyzing cross-lingual representations and alignments. The work shows that combining translation-based objectives with PNAT reduces transfer gaps and improves token-level alignment, suggesting translation data as a practical lever for stronger multilingual models. Overall, mT6 advances cross-lingual transfer and generation, highlighting scalable pathways for multilingual NLP systems.

Abstract

Multilingual T5 (mT5) pretrains a sequence-to-sequence model on massive monolingual texts, which has shown promising results on many cross-lingual tasks. In this paper, we improve multilingual text-to-text transfer Transformer with translation pairs (mT6). Specifically, we explore three cross-lingual text-to-text pre-training tasks, namely, machine translation, translation pair span corruption, and translation span corruption. In addition, we propose a partially non-autoregressive objective for text-to-text pre-training. We evaluate the methods on eight multilingual benchmark datasets, including sentence classification, named entity recognition, question answering, and abstractive summarization. Experimental results show that the proposed mT6 improves cross-lingual transferability over mT5.

Paper Structure

This paper contains 35 sections, 7 equations, 4 figures, 17 tables.

Figures (4)

  • Figure 1: Example of the span corruption task t5 used in T5 and mT5.
  • Figure 2: Overview of three cross-lingual text-to-text pre-training tasks. For each task, we provide an example of the input and target text. The words marked with "$\times$" are randomly replaced with unique mask tokens like $\left[\text{M}_1\right]$. Notice that in the translation span corruption task, we mask tokens only in one language.
  • Figure 3: Partially non-autoregressive objective.
  • Figure 4: Evaluation results of different layers on Tatoeba cross-lingual sentence retrieval. We illustrate the average accuracy@1 scores on the Tatoeba test sets of the 14 language pairs covered by the parallel data.