Table of Contents
Fetching ...

TIM: Teaching Large Language Models to Translate with Comparison

Jiali Zeng, Fandong Meng, Yongjing Yin, Jie Zhou

TL;DR

TIM introduces comparison-based instruction tuning to improve translation in open-source LLMs by combining output comparison and a novel preference loss. It constructs order-, dictionary-, and error-guided data, and leverages three tuning strategies (LoRA, FixEmb, Full) to balance efficiency and capacity. Empirical results on WMT22 and FLORES-200 show TIM consistently boosts BLEU and COMET scores, enhances zero-shot translation, and yields strong quality-estimation performance without references, especially for smaller models. These findings suggest TIM as a practical framework for closing the translation gap in smaller, openly available LLMs and for joint translation-evaluation tasks.

Abstract

Open-sourced large language models (LLMs) have demonstrated remarkable efficacy in various tasks with instruction tuning. However, these models can sometimes struggle with tasks that require more specialized knowledge such as translation. One possible reason for such deficiency is that instruction tuning aims to generate fluent and coherent text that continues from a given instruction without being constrained by any task-specific requirements. Moreover, it can be more challenging for tuning smaller LLMs with lower-quality training data. To address this issue, we propose a novel framework using examples in comparison to teach LLMs to learn translation. Our approach involves presenting the model with examples of correct and incorrect translations and using a preference loss to guide the model's learning. We evaluate our method on WMT2022 test sets and show that it outperforms existing methods. Our findings offer a new perspective on fine-tuning LLMs for translation tasks and provide a promising solution for generating high-quality translations. Please refer to Github for more details: https://github.com/lemon0830/TIM.

TIM: Teaching Large Language Models to Translate with Comparison

TL;DR

TIM introduces comparison-based instruction tuning to improve translation in open-source LLMs by combining output comparison and a novel preference loss. It constructs order-, dictionary-, and error-guided data, and leverages three tuning strategies (LoRA, FixEmb, Full) to balance efficiency and capacity. Empirical results on WMT22 and FLORES-200 show TIM consistently boosts BLEU and COMET scores, enhances zero-shot translation, and yields strong quality-estimation performance without references, especially for smaller models. These findings suggest TIM as a practical framework for closing the translation gap in smaller, openly available LLMs and for joint translation-evaluation tasks.

Abstract

Open-sourced large language models (LLMs) have demonstrated remarkable efficacy in various tasks with instruction tuning. However, these models can sometimes struggle with tasks that require more specialized knowledge such as translation. One possible reason for such deficiency is that instruction tuning aims to generate fluent and coherent text that continues from a given instruction without being constrained by any task-specific requirements. Moreover, it can be more challenging for tuning smaller LLMs with lower-quality training data. To address this issue, we propose a novel framework using examples in comparison to teach LLMs to learn translation. Our approach involves presenting the model with examples of correct and incorrect translations and using a preference loss to guide the model's learning. We evaluate our method on WMT2022 test sets and show that it outperforms existing methods. Our findings offer a new perspective on fine-tuning LLMs for translation tasks and provide a promising solution for generating high-quality translations. Please refer to Github for more details: https://github.com/lemon0830/TIM.
Paper Structure (28 sections, 3 equations, 6 figures, 4 tables)

This paper contains 28 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustration of three types of output comparison. The text in blue highlights the difference between the added notes and the resulting difference due to these specific notes.
  • Figure 2: Overall framework of our proposed TIM. Given the contrastive outputs of each instance, we optimize the LLMs with the general language modeling loss and the token-level preference loss.
  • Figure 3: An example of contrastive outputs for preference Comparison. The "Bad Output" denotes the noisy translation used to be compared with the "Output".
  • Figure 4: Effect of instructions. We fine-tune BLOOMZ-7b-mt with our TIM and report BLEU scores of 10 different instructions on four language pairs.
  • Figure 5: Effect of model sizes. We present a comparison between TIM and instruction tuning across LLMs with different model sizes including BLOOM-1b7, BLOOM-3b, BLOOMZ-7b-mt, LLaMA-2-7b, and LLaMA-2-13b.
  • ...and 1 more figures