Table of Contents
Fetching ...

Evaluating o1-Like LLMs: Unlocking Reasoning for Translation through Comprehensive Analysis

Andong Chen, Yuchen Song, Wenxin Zhu, Kehai Chen, Muyun Yang, Tiejun Zhao, Min zhang

TL;DR

This study systematically evaluates o1-Like LLMs for multilingual machine translation across Flores-200, Commonsense MT, Culture MT, and RTT benchmarks, comparing them to ChatGPT, GPT-4o, and DeepSeek-v3. It demonstrates that o1-Like LLMs achieve strong multilingual and cultural translation capabilities, with DeepSeek-R1 notably surpassing GPT-4o in contextless tasks, yet they exhibit rambling Chinese outputs and substantially higher inference costs. The analysis highlights that model size generally improves translation quality, temperature strongly affects results, and there are notable trade-offs between reasoning-driven performance and efficiency. The findings inform deployment and optimization of reasoning-based MT systems, underscoring the need for efficiency improvements, better instruction adherence, and external modules to mitigate hallucinations in complex translation scenarios.

Abstract

The o1-Like LLMs are transforming AI by simulating human cognitive processes, but their performance in multilingual machine translation (MMT) remains underexplored. This study examines: (1) how o1-Like LLMs perform in MMT tasks and (2) what factors influence their translation quality. We evaluate multiple o1-Like LLMs and compare them with traditional models like ChatGPT and GPT-4o. Results show that o1-Like LLMs establish new multilingual translation benchmarks, with DeepSeek-R1 surpassing GPT-4o in contextless tasks. They demonstrate strengths in historical and cultural translation but exhibit a tendency for rambling issues in Chinese-centric outputs. Further analysis reveals three key insights: (1) High inference costs and slower processing speeds make complex translation tasks more resource-intensive. (2) Translation quality improves with model size, enhancing commonsense reasoning and cultural translation. (3) The temperature parameter significantly impacts output quality-lower temperatures yield more stable and accurate translations, while higher temperatures reduce coherence and precision.

Evaluating o1-Like LLMs: Unlocking Reasoning for Translation through Comprehensive Analysis

TL;DR

This study systematically evaluates o1-Like LLMs for multilingual machine translation across Flores-200, Commonsense MT, Culture MT, and RTT benchmarks, comparing them to ChatGPT, GPT-4o, and DeepSeek-v3. It demonstrates that o1-Like LLMs achieve strong multilingual and cultural translation capabilities, with DeepSeek-R1 notably surpassing GPT-4o in contextless tasks, yet they exhibit rambling Chinese outputs and substantially higher inference costs. The analysis highlights that model size generally improves translation quality, temperature strongly affects results, and there are notable trade-offs between reasoning-driven performance and efficiency. The findings inform deployment and optimization of reasoning-based MT systems, underscoring the need for efficiency improvements, better instruction adherence, and external modules to mitigate hallucinations in complex translation scenarios.

Abstract

The o1-Like LLMs are transforming AI by simulating human cognitive processes, but their performance in multilingual machine translation (MMT) remains underexplored. This study examines: (1) how o1-Like LLMs perform in MMT tasks and (2) what factors influence their translation quality. We evaluate multiple o1-Like LLMs and compare them with traditional models like ChatGPT and GPT-4o. Results show that o1-Like LLMs establish new multilingual translation benchmarks, with DeepSeek-R1 surpassing GPT-4o in contextless tasks. They demonstrate strengths in historical and cultural translation but exhibit a tendency for rambling issues in Chinese-centric outputs. Further analysis reveals three key insights: (1) High inference costs and slower processing speeds make complex translation tasks more resource-intensive. (2) Translation quality improves with model size, enhancing commonsense reasoning and cultural translation. (3) The temperature parameter significantly impacts output quality-lower temperatures yield more stable and accurate translations, while higher temperatures reduce coherence and precision.

Paper Structure

This paper contains 20 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: The main results of instruction following analysis are presented.
  • Figure 2: An example of rambling issues is illustrated. The so-called "final translation" deviates from the exact translation, instead providing an explanation of the source text.
  • Figure 3: The main results of multi-scale model are presented.
  • Figure 4: The main result of temperature analysis is presented.
  • Figure 5: An example of rambling issues is illustrated. The so-called "final translation" deviates from the exact translation, instead providing an explanation of the source text.
  • ...and 1 more figures