Table of Contents
Fetching ...

Enhancing Document-level Translation of Large Language Model via Translation Mixed-instructions

Yachao Li, Junhui Li, Jing Jiang, Min Zhang

TL;DR

This paper tackles the challenge of document-level translation with lightweight LLMs trained on sentence-level instructions, which exhibit poor performance on long inputs beyond $512$ tokens. It proposes translation mixed-instructions, combining sentence-level and document-level prompts of varying lengths, and fine-tunes Llama-2 models (7B and 13B) with LoRA on News Commentary data across $10$ language pairs. The results show that mixed-instructions substantially improve document-level translation quality (d-BLEU, COMET) and discourse coherence, while maintaining sentence-level performance; 512-length document segmentation and a 512-token interval help balance data size and performance, achieving robust results up to $2048$ tokens. Overall, the approach demonstrates that instruction design targeting document-level mapping enables strong doc-level MT in smaller LLMs, offering practical gains for cross-language translation and discourse-aware translation tasks.

Abstract

Existing large language models (LLMs) for machine translation are typically fine-tuned on sentence-level translation instructions and achieve satisfactory performance at the sentence level. However, when applied to document-level translation, these models face a significant challenge, particularly when dealing with documents containing over 512 tokens. This challenge arises from the issue of sentence-level coverage, where subsequent sentences in the document remain untranslated. As a result, the document-level translation capability of LLMs fine-tuned on sentence-level translation instructions is significantly limited. We conjecture that the primary cause of LLMs' weak document-level translation performance is the absence of document-to-document mapping ability. To address the issue, we propose an approach that combines sentence-level and document-level translation instructions of varying lengths to fine-tune LLMs. Our proposed translation mixed-instructions enable LLMs (Llama-2~7B and 13B) to maintain consistent translation performance from the sentence level to documents containing as many as 2048 tokens. Extensive experimental results show that the proposed approach significantly enhances the document-level translation capabilities of LLMs on 10 language pairs, effectively mitigating the sentence-level coverage issue in document-level translation. Experimentation on discourse phenomena has demonstrated that our document-level translation approach significantly improves translation quality, both in terms of BLEU score and discourse coherence.

Enhancing Document-level Translation of Large Language Model via Translation Mixed-instructions

TL;DR

This paper tackles the challenge of document-level translation with lightweight LLMs trained on sentence-level instructions, which exhibit poor performance on long inputs beyond tokens. It proposes translation mixed-instructions, combining sentence-level and document-level prompts of varying lengths, and fine-tunes Llama-2 models (7B and 13B) with LoRA on News Commentary data across language pairs. The results show that mixed-instructions substantially improve document-level translation quality (d-BLEU, COMET) and discourse coherence, while maintaining sentence-level performance; 512-length document segmentation and a 512-token interval help balance data size and performance, achieving robust results up to tokens. Overall, the approach demonstrates that instruction design targeting document-level mapping enables strong doc-level MT in smaller LLMs, offering practical gains for cross-language translation and discourse-aware translation tasks.

Abstract

Existing large language models (LLMs) for machine translation are typically fine-tuned on sentence-level translation instructions and achieve satisfactory performance at the sentence level. However, when applied to document-level translation, these models face a significant challenge, particularly when dealing with documents containing over 512 tokens. This challenge arises from the issue of sentence-level coverage, where subsequent sentences in the document remain untranslated. As a result, the document-level translation capability of LLMs fine-tuned on sentence-level translation instructions is significantly limited. We conjecture that the primary cause of LLMs' weak document-level translation performance is the absence of document-to-document mapping ability. To address the issue, we propose an approach that combines sentence-level and document-level translation instructions of varying lengths to fine-tune LLMs. Our proposed translation mixed-instructions enable LLMs (Llama-2~7B and 13B) to maintain consistent translation performance from the sentence level to documents containing as many as 2048 tokens. Extensive experimental results show that the proposed approach significantly enhances the document-level translation capabilities of LLMs on 10 language pairs, effectively mitigating the sentence-level coverage issue in document-level translation. Experimentation on discourse phenomena has demonstrated that our document-level translation approach significantly improves translation quality, both in terms of BLEU score and discourse coherence.
Paper Structure (19 sections, 1 equation, 7 figures, 4 tables)

This paper contains 19 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Translation performance of Llama-2 with different input lengths. "SENT" denotes the sentence-level input. For a fair comparison, we first obtain the document-level translation of original documents and then compute document-level BLEU (d-BLEU).
  • Figure 2: Example of sentence-level translation instruction in Jason format.
  • Figure 3: Example of document-level translation instruction with a document of 6 sentences.
  • Figure 4: Document-level (512-length tokens) translation performance of Llama-2 fine-tuned on sentence-level instructions.
  • Figure 5: The Zh-En translation.
  • ...and 2 more figures