Table of Contents
Fetching ...

Towards Making the Most of ChatGPT for Machine Translation

Keqin Peng, Liang Ding, Qihuang Zhong, Li Shen, Xuebo Liu, Min Zhang, Yuanxin Ouyang, Dacheng Tao

TL;DR

This study investigates how to maximize ChatGPT's machine translation capabilities by tuning temperature and shaping prompts. It introduces Task-Specific Prompts (TSP) and Domain-Specific Prompts (DSP) and systematically evaluates zero-shot and few-shot MT across Flores-200 and cross-domain datasets using COMET as the primary metric. Key findings show that low temperature improves translation quality, task- and domain-informed prompts boost performance (with DSP sometimes surpassing a major translator), but non-English-centric tasks can yield hallucinations and chain-of-thought prompts can degrade results. The work highlights actionable prompt designs for MT with LLMs and discusses implications for future research in prompt engineering and MT evaluation.

Abstract

ChatGPT shows remarkable capabilities for machine translation (MT). Several prior studies have shown that it achieves comparable results to commercial systems for high-resource languages, but lags behind in complex tasks, e.g., low-resource and distant-language-pairs translation. However, they usually adopt simple prompts which can not fully elicit the capability of ChatGPT. In this paper, we aim to further mine ChatGPT's translation ability by revisiting several aspects: temperature, task information, and domain information, and correspondingly propose an optimal temperature setting and two (simple but effective) prompts: Task-Specific Prompts (TSP) and Domain-Specific Prompts (DSP). We show that: 1) The performance of ChatGPT depends largely on temperature, and a lower temperature usually can achieve better performance; 2) Emphasizing the task information can further improve ChatGPT's performance, particularly in complex MT tasks; 3) Introducing domain information can elicit ChatGPT's generalization ability and improve its performance in the specific domain; 4) ChatGPT tends to generate hallucinations for non-English-centric MT tasks, which can be partially addressed by our proposed prompts but still need to be highlighted for the MT/NLP community. We also explore the effects of advanced in-context learning strategies and find a (negative but interesting) observation: the powerful chain-of-thought prompt leads to word-by-word translation behavior, thus bringing significant translation degradation.

Towards Making the Most of ChatGPT for Machine Translation

TL;DR

This study investigates how to maximize ChatGPT's machine translation capabilities by tuning temperature and shaping prompts. It introduces Task-Specific Prompts (TSP) and Domain-Specific Prompts (DSP) and systematically evaluates zero-shot and few-shot MT across Flores-200 and cross-domain datasets using COMET as the primary metric. Key findings show that low temperature improves translation quality, task- and domain-informed prompts boost performance (with DSP sometimes surpassing a major translator), but non-English-centric tasks can yield hallucinations and chain-of-thought prompts can degrade results. The work highlights actionable prompt designs for MT with LLMs and discusses implications for future research in prompt engineering and MT evaluation.

Abstract

ChatGPT shows remarkable capabilities for machine translation (MT). Several prior studies have shown that it achieves comparable results to commercial systems for high-resource languages, but lags behind in complex tasks, e.g., low-resource and distant-language-pairs translation. However, they usually adopt simple prompts which can not fully elicit the capability of ChatGPT. In this paper, we aim to further mine ChatGPT's translation ability by revisiting several aspects: temperature, task information, and domain information, and correspondingly propose an optimal temperature setting and two (simple but effective) prompts: Task-Specific Prompts (TSP) and Domain-Specific Prompts (DSP). We show that: 1) The performance of ChatGPT depends largely on temperature, and a lower temperature usually can achieve better performance; 2) Emphasizing the task information can further improve ChatGPT's performance, particularly in complex MT tasks; 3) Introducing domain information can elicit ChatGPT's generalization ability and improve its performance in the specific domain; 4) ChatGPT tends to generate hallucinations for non-English-centric MT tasks, which can be partially addressed by our proposed prompts but still need to be highlighted for the MT/NLP community. We also explore the effects of advanced in-context learning strategies and find a (negative but interesting) observation: the powerful chain-of-thought prompt leads to word-by-word translation behavior, thus bringing significant translation degradation.
Paper Structure (27 sections, 3 figures, 10 tables)

This paper contains 27 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: The relationship between temperature and ChatGPT's performance (in terms of COMET scores) when translating from English to other languages.
  • Figure 2: The relationship between temperature and ChatGPT's performance (in terms of BLEU scores) when translating from English to other languages.
  • Figure 3: Number of Post-Edited sentences in non-English-centric language pairs, where a higher value means the translation contains more hallucinations. RO represents the translation for ZH$\Rightarrow$RO, while ZH represents the translation for ZH$\Rightarrow$RO.