Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning

Tianxiang Hu; Pei Zhang; Baosong Yang; Jun Xie; Derek F. Wong; Rui Wang

Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning

Tianxiang Hu, Pei Zhang, Baosong Yang, Jun Xie, Derek F. Wong, Rui Wang

TL;DR

A domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance, and achieves notable enhancements in translation accuracy and domain robustness than traditional fine-tuning.

Abstract

Achieving consistent high-quality machine translation (MT) across diverse domains remains a significant challenge, primarily due to the limited and imbalanced parallel training data available in various domains. While large language models (LLMs) have demonstrated impressive general understanding and generation abilities, their potential in multi-domain MT is under-explored. We establish a comprehensive benchmark for multi-domain translation, featuring 25 German$\Leftrightarrow$English and 22 Chinese$\Leftrightarrow$English test sets respectively covering 15 domains. Our evaluation of prominent LLMs reveals a discernible performance gap against traditional MT systems, highlighting domain overfitting and catastrophic forgetting issues after fine-tuning on domain-limited corpora. To mitigate this, we propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance. This method inspires the LLM to perceive domain information from the source text, which then serves as a helpful hint to guide the translation process. Despite being trained on a small dataset of four domains, our CoT fine-tune approach achieves notable enhancements in translation accuracy and domain robustness than traditional fine-tuning, as evidenced by an average 1.53 BLEU score increase in over 20 German$\rightarrow$English distinct out-of-domain tests.

Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning

TL;DR

Abstract

English and 22 Chinese

English test sets respectively covering 15 domains. Our evaluation of prominent LLMs reveals a discernible performance gap against traditional MT systems, highlighting domain overfitting and catastrophic forgetting issues after fine-tuning on domain-limited corpora. To mitigate this, we propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance. This method inspires the LLM to perceive domain information from the source text, which then serves as a helpful hint to guide the translation process. Despite being trained on a small dataset of four domains, our CoT fine-tune approach achieves notable enhancements in translation accuracy and domain robustness than traditional fine-tuning, as evidenced by an average 1.53 BLEU score increase in over 20 German

English distinct out-of-domain tests.

Paper Structure (25 sections, 3 equations, 4 figures, 6 tables)

This paper contains 25 sections, 3 equations, 4 figures, 6 tables.

Introduction
Related Work
Multi-Domain Machine Translation
LLMs for Translation
Benchmarking LLMs for Multi-domain Machine Translation
Settings
Results
Analyzing LLM’s Translation Performance with Fining-Tuning
Settings
Results
Overfitting Problem in Fine-Tuning LLM
Method
Chain-of-Thought Fine-tuning
Experiments
Settings
...and 10 more sections

Figures (4)

Figure 1: Performance comparision of prominent LLMs on the multi-domain German-to-English translation (other languages directions in Appendix \ref{['sec:appendix_results']}). For clear comparison, we show the scores normalized by the maximum score in each domain. The performance of LLMs varies greatly across multi-domains. Best reviewed in colors.
Figure 2: Comparison of our proposed CoT fine-tuning and traditional fine-tuning framework in training and inference process. During the training process, it includes a domain hint generation task and a domain translation task; During the inference process, it first generates domain-aware hints and then undertakes domain translation based on the generated domain hints.
Figure 3: Model performance changes as training epoch progresses. For De-En, with fine-tuning on LLaMA-2-7b, there is an overfitting problem on OOD data. CoT-FT alleviates the problem to a certain extent.
Figure 4: BLEU and COMET scores for FT and CoT-FT at different data sizes(De-En). The proposed CoT fine-tuning outperforms direct fine-tuning in both in-domain and OOD scenarios.

Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning

TL;DR

Abstract

Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)