Table of Contents
Fetching ...

Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation

Mozhdeh Gheini, Xiang Ren, Jonathan May

TL;DR

This work interrogates the role of cross-attention in adapting pretrained Transformers for machine translation under transfer learning. By isolating cross-attention and new embeddings as the only trainable components, the authors show that cross-attention fine-tuning can nearly match full model fine-tuning across multiple language-pair transfers, while dramatically reducing storage needs. They reveal that pretrained cross-attention provides translation-specific knowledge and induces alignment between child and parent embeddings, a property that supports mitigating catastrophic forgetting and enables zero-shot translation. The findings suggest a practical, parameter-efficient path for extending MT models to many language pairs and invite further study into module-specific transfer dynamics and cross-lingual representations.

Abstract

We study the power of cross-attention in the Transformer architecture within the context of transfer learning for machine translation, and extend the findings of studies into cross-attention when training from scratch. We conduct a series of experiments through fine-tuning a translation model on data where either the source or target language has changed. These experiments reveal that fine-tuning only the cross-attention parameters is nearly as effective as fine-tuning all parameters (i.e., the entire translation model). We provide insights into why this is the case and observe that limiting fine-tuning in this manner yields cross-lingually aligned embeddings. The implications of this finding for researchers and practitioners include a mitigation of catastrophic forgetting, the potential for zero-shot translation, and the ability to extend machine translation models to several new language pairs with reduced parameter storage overhead.

Cross-Attention is All You Need: Adapting Pretrained Transformers for Machine Translation

TL;DR

This work interrogates the role of cross-attention in adapting pretrained Transformers for machine translation under transfer learning. By isolating cross-attention and new embeddings as the only trainable components, the authors show that cross-attention fine-tuning can nearly match full model fine-tuning across multiple language-pair transfers, while dramatically reducing storage needs. They reveal that pretrained cross-attention provides translation-specific knowledge and induces alignment between child and parent embeddings, a property that supports mitigating catastrophic forgetting and enables zero-shot translation. The findings suggest a practical, parameter-efficient path for extending MT models to many language pairs and invite further study into module-specific transfer dynamics and cross-lingual representations.

Abstract

We study the power of cross-attention in the Transformer architecture within the context of transfer learning for machine translation, and extend the findings of studies into cross-attention when training from scratch. We conduct a series of experiments through fine-tuning a translation model on data where either the source or target language has changed. These experiments reveal that fine-tuning only the cross-attention parameters is nearly as effective as fine-tuning all parameters (i.e., the entire translation model). We provide insights into why this is the case and observe that limiting fine-tuning in this manner yields cross-lingually aligned embeddings. The implications of this finding for researchers and practitioners include a mitigation of catastrophic forgetting, the potential for zero-shot translation, and the ability to extend machine translation models to several new language pairs with reduced parameter storage overhead.

Paper Structure

This paper contains 32 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of our transfer learning experiments, depicting (a) training from scratch, (b) conventional fine-tuning (src+body), (c) fine-tuning cross-attention (src+xattn), (d) fine-tuning new vocabulary (src), (e) fine-tuning cross-attention when transferring target language (tgt+xattn), (f) transfer learning with updating cross-attention from scratch (src+randxattn). Dotted components are initialized randomly, while solid lines are initialized with parameters from a pretrained model. Shaded, underlined components are fine-tuned, while other components are frozen.
  • Figure 2: BLEU scores across different transfer settings using mBART as parent. Exclusive fine-tuning of embeddings (embed) is not effective at all due to lack of translation knowledge in the cross-attention layers.
  • Figure 3: Accuracy of bilingual dictionaries induced through embeddings learned under tgt+body and tgt+xattn settings. De and Es effectively get aligned with En under tgt+xattn (left). As they are both aligned to En, we can also indirectly obtain a De--Es dictionary (right). Similar practice completely fails under tgt+body.
  • Figure 4: Performance on the original language pair after transfer. The original Fr--En parent model scores 35.0 BLEU on the Fr--En test set. {src,tgt}+xattn outperforms {src,tgt}+body on the parent task.