Table of Contents
Fetching ...

Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation

Zhi Qu, Yiran Wang, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Taro Watanabe

TL;DR

This work identifies the lack of language transfer as the main weakness of decoder-only MNMT. It introduces Two-stage Decoder-only (TDO) decoding to explicitly separate source-to-target transfer in the first stage from target token generation in the second stage, and adds Instruction-level Contrastive Learning (InstruCL) to supervise source representations toward the target language. Across TED-19 and OPUS-100, TDO achieves competitive supervised performance while InstruCL substantially improves zero-shot translation, with reported gains up to 3.39 BLEU and 4.81 COMET. Representational analyses confirm that the improvements stem from enhanced language transfer in the decoder-only framework, suggesting a practical route to scalable, efficient multilingual translation without full encoder-decoder architectures.

Abstract

Existing multilingual neural machine translation (MNMT) approaches mainly focus on improving models with the encoder-decoder architecture to translate multiple languages. However, decoder-only architecture has been explored less in MNMT due to its underperformance when trained on parallel data solely. In this work, we attribute the issue of the decoder-only architecture to its lack of language transfer capability. Specifically, the decoder-only architecture is insufficient in encoding source tokens with the target language features. We propose dividing the decoding process into two stages so that target tokens are explicitly excluded in the first stage to implicitly boost the transfer capability across languages. Additionally, we impose contrastive learning on translation instructions, resulting in improved performance in zero-shot translation. We conduct experiments on TED-19 and OPUS-100 datasets, considering both training from scratch and fine-tuning scenarios. Experimental results show that, compared to the encoder-decoder architecture, our methods not only perform competitively in supervised translations but also achieve improvements of up to 3.39 BLEU, 6.99 chrF++, 3.22 BERTScore, and 4.81 COMET in zero-shot translations.

Improving Language Transfer Capability of Decoder-only Architecture in Multilingual Neural Machine Translation

TL;DR

This work identifies the lack of language transfer as the main weakness of decoder-only MNMT. It introduces Two-stage Decoder-only (TDO) decoding to explicitly separate source-to-target transfer in the first stage from target token generation in the second stage, and adds Instruction-level Contrastive Learning (InstruCL) to supervise source representations toward the target language. Across TED-19 and OPUS-100, TDO achieves competitive supervised performance while InstruCL substantially improves zero-shot translation, with reported gains up to 3.39 BLEU and 4.81 COMET. Representational analyses confirm that the improvements stem from enhanced language transfer in the decoder-only framework, suggesting a practical route to scalable, efficient multilingual translation without full encoder-decoder architectures.

Abstract

Existing multilingual neural machine translation (MNMT) approaches mainly focus on improving models with the encoder-decoder architecture to translate multiple languages. However, decoder-only architecture has been explored less in MNMT due to its underperformance when trained on parallel data solely. In this work, we attribute the issue of the decoder-only architecture to its lack of language transfer capability. Specifically, the decoder-only architecture is insufficient in encoding source tokens with the target language features. We propose dividing the decoding process into two stages so that target tokens are explicitly excluded in the first stage to implicitly boost the transfer capability across languages. Additionally, we impose contrastive learning on translation instructions, resulting in improved performance in zero-shot translation. We conduct experiments on TED-19 and OPUS-100 datasets, considering both training from scratch and fine-tuning scenarios. Experimental results show that, compared to the encoder-decoder architecture, our methods not only perform competitively in supervised translations but also achieve improvements of up to 3.39 BLEU, 6.99 chrF++, 3.22 BERTScore, and 4.81 COMET in zero-shot translations.

Paper Structure

This paper contains 36 sections, 9 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Comparison between different architectures in preliminary experiments on TED-19. Figure \ref{['fig:comparison_performance']} shows the BLEU score. Figure \ref{['fig:preference']} shows the layer-wise language feature representations of a sentence where the x-axis indicates the layer number and the vertical line indicates the value range. Specifically, we follow transfer to compute a similarity score, where values higher than 0.5 mean the representation exhibits the target language features more and lower than 0.5 indicates showing more source language features. Appendix \ref{['appendix:preference']} provides the details of implementation.
  • Figure 2: Illustration of the encoder-decoder architecture and the decoder-only architecture.
  • Figure 3: Illustration of proposed methods. Notably, the term, Token, not only means the real token before and after the processing of model, but also refers to the representation in the corresponding position. (a) shows the Two-stage Decoder-only and shows the Adaption, i.e., using an additional FFN to narrow the gap between source and target representations by non-linear transformation. (b) shows the Instruction-level Contrastive Learning. Underline marks target tokens, and [*] means the instruction of this instance. For the anchor, negative instances in this figure meet at least one of two features: 1) different target language and 2) unparallel semantics.
  • Figure 4: Illustration of linguistic preference, which follows Figure \ref{['fig:preference']}. All cases in this figure use the Prefix manner for the masked self-attention mechanism. The marker of prefix decoder-only is square, and our proposed methods are round. The x-axis is the index of layers, and the vertical line indicates the value range.
  • Figure 5: Variation in different values of M. The y-axis is the variation ratio compared to the performance of the model with prefix decoder-only architecture, and the x-axis is the value of M. The values of $N$ are 6 and 12 in TED-19 and OPUS-100 respectively. Additionally, the line and the dotted line indicate supervised and zero-shot translations respectively.
  • ...and 4 more figures