Table of Contents
Fetching ...

VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation

Fuli Luo, Wei Wang, Jiahao Liu, Yijia Liu, Bin Bi, Songfang Huang, Fei Huang, Luo Si

TL;DR

VeCo introduces a novel cross-lingual pre-training paradigm by integrating a plug-in cross-attention module into a Transformer encoder, enabling explicit inter-language alignment through the CA-MLM objective. The model supports on-demand fine-tuning for both cross-lingual understanding and generation, facilitating encoder-only or encoder-decoder initialization. Empirical results show state-of-the-art performance on XTREME across multiple tasks and strong BLEU gains on WMT14 En-De/En-Fr, with ablations confirming the value of bilingual data and CA-MLM. The work provides a practical and scalable framework for unified cross-lingual learning with flexible task-specific fine-tuning strategies.

Abstract

Existing work in multilingual pretraining has demonstrated the potential of cross-lingual transferability by training a unified Transformer encoder for multiple languages. However, much of this work only relies on the shared vocabulary and bilingual contexts to encourage the correlation across languages, which is loose and implicit for aligning the contextual representations between languages. In this paper, we plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. More importantly, when fine-tuning on downstream tasks, the cross-attention module can be plugged in or out on-demand, thus naturally benefiting a wider range of cross-lingual tasks, from language understanding to generation. As a result, the proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark, covering text classification, sequence labeling, question answering, and sentence retrieval. For cross-lingual generation tasks, it also outperforms all existing cross-lingual models and state-of-the-art Transformer variants on WMT14 English-to-German and English-to-French translation datasets, with gains of up to 1~2 BLEU.

VECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation

TL;DR

VeCo introduces a novel cross-lingual pre-training paradigm by integrating a plug-in cross-attention module into a Transformer encoder, enabling explicit inter-language alignment through the CA-MLM objective. The model supports on-demand fine-tuning for both cross-lingual understanding and generation, facilitating encoder-only or encoder-decoder initialization. Empirical results show state-of-the-art performance on XTREME across multiple tasks and strong BLEU gains on WMT14 En-De/En-Fr, with ablations confirming the value of bilingual data and CA-MLM. The work provides a practical and scalable framework for unified cross-lingual learning with flexible task-specific fine-tuning strategies.

Abstract

Existing work in multilingual pretraining has demonstrated the potential of cross-lingual transferability by training a unified Transformer encoder for multiple languages. However, much of this work only relies on the shared vocabulary and bilingual contexts to encourage the correlation across languages, which is loose and implicit for aligning the contextual representations between languages. In this paper, we plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. More importantly, when fine-tuning on downstream tasks, the cross-attention module can be plugged in or out on-demand, thus naturally benefiting a wider range of cross-lingual tasks, from language understanding to generation. As a result, the proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark, covering text classification, sequence labeling, question answering, and sentence retrieval. For cross-lingual generation tasks, it also outperforms all existing cross-lingual models and state-of-the-art Transformer variants on WMT14 English-to-German and English-to-French translation datasets, with gains of up to 1~2 BLEU.

Paper Structure

This paper contains 30 sections, 5 equations, 3 figures, 17 tables.

Figures (3)

  • Figure 1: The attention scores of XLM and XLM-R with the input of a pair of parallel sentences: Take a seat and have a rest in English and its translated Chinese sentence. The darker line denotes a higher score. We can found that there are only a few attention patterns across English and Chinese subwords.
  • Figure 2: A schematic comparison of cross-lingual pre-training tasks and their attention matrices. When predicting the masked words of different languages: a) MLM can only attend to the context in its own language; b) TLM implicitly attend to a part of words across languages (as shown in Figure \ref{['fig:motivation']}). However, c) the proposed CA-MLM can: (1) not only attend to the context in its own language to predict words $\bm x_2$ and $\bm y_3$, (2) but also can firstly attend to its own context and then explicitly attend to all words across languages to predict words $\bm x_3$ and $\bm y_2$ via a plug-in cross-attention module.
  • Figure 3: The overview of VeCo. During pre-training, a plug-and-play cross-attention module is jointly pre-trained along with the self-attention module. When fine-tuning on natural language understanding (NLU) tasks, the cross-attention module can be either plug-in or plug-out on demand. When fine-tuning on natural language generation (NLG) tasks, VeCo can initialize an encoder-decoder module (the mainstream backbone model of generation tasks) since all those necessary modules in the encoder and decoder are already pre-trained.