Table of Contents
Fetching ...

The Paradox of Poetic Intent in Back-Translation: Evaluating the Quality of Large Language Models in Chinese Translation

Li Weigang, Pedro Carvalho Brom

TL;DR

This paper investigates Chinese→English→Chinese back-translation quality using a diverse corpus that covers terminological, historical, and poetic texts. It introduces the BT-Fried framework, combining multiple automatic metrics with non-parametric statistics (Friedman test) to robustly compare six LLMs and traditional tools across 89 chemistry abstracts and poetic-literary content. A key finding is the Paradox of Poetic Intent: LLMs often prioritize surface-level literal fidelity at the cost of deeper cultural meaning and imagery, with phenomena like verbatim back-translation suggesting quasi-self-awareness. The work also proposes a Jieba-based BLEU variant and demonstrates that non-reasoning models can outperform reasoning-enabled ones on semantic preservation, highlighting practical implications for CNLP evaluation and model fine-tuning. Overall, the study provides empirical benchmarks and a conceptual lens for advancing culturally aware Chinese translation and cross-lingual language modeling.

Abstract

The rapid advancement of large language models (LLMs) has reshaped the landscape of machine translation, yet challenges persist in preserving poetic intent, cultural heritage, and handling specialized terminology in Chinese-English translation. This study constructs a diverse corpus encompassing Chinese scientific terminology, historical translation paradoxes, and literary metaphors. Utilizing a back-translation and Friedman test-based evaluation system (BT-Fried), we evaluate BLEU, CHRF, TER, and semantic similarity metrics across six major LLMs (e.g., GPT-4.5, DeepSeek V3) and three traditional translation tools. Key findings include: (1) Scientific abstracts often benefit from back-translation, while traditional tools outperform LLMs in linguistically distinct texts; (2) LLMs struggle with cultural and literary retention, exemplifying the "paradox of poetic intent"; (3) Some models exhibit "verbatim back-translation", reflecting emergent memory behavior; (4) A novel BLEU variant using Jieba segmentation and n-gram weighting is proposed. The study contributes to the empirical evaluation of Chinese NLP performance and advances understanding of cultural fidelity in AI-mediated translation.

The Paradox of Poetic Intent in Back-Translation: Evaluating the Quality of Large Language Models in Chinese Translation

TL;DR

This paper investigates Chinese→English→Chinese back-translation quality using a diverse corpus that covers terminological, historical, and poetic texts. It introduces the BT-Fried framework, combining multiple automatic metrics with non-parametric statistics (Friedman test) to robustly compare six LLMs and traditional tools across 89 chemistry abstracts and poetic-literary content. A key finding is the Paradox of Poetic Intent: LLMs often prioritize surface-level literal fidelity at the cost of deeper cultural meaning and imagery, with phenomena like verbatim back-translation suggesting quasi-self-awareness. The work also proposes a Jieba-based BLEU variant and demonstrates that non-reasoning models can outperform reasoning-enabled ones on semantic preservation, highlighting practical implications for CNLP evaluation and model fine-tuning. Overall, the study provides empirical benchmarks and a conceptual lens for advancing culturally aware Chinese translation and cross-lingual language modeling.

Abstract

The rapid advancement of large language models (LLMs) has reshaped the landscape of machine translation, yet challenges persist in preserving poetic intent, cultural heritage, and handling specialized terminology in Chinese-English translation. This study constructs a diverse corpus encompassing Chinese scientific terminology, historical translation paradoxes, and literary metaphors. Utilizing a back-translation and Friedman test-based evaluation system (BT-Fried), we evaluate BLEU, CHRF, TER, and semantic similarity metrics across six major LLMs (e.g., GPT-4.5, DeepSeek V3) and three traditional translation tools. Key findings include: (1) Scientific abstracts often benefit from back-translation, while traditional tools outperform LLMs in linguistically distinct texts; (2) LLMs struggle with cultural and literary retention, exemplifying the "paradox of poetic intent"; (3) Some models exhibit "verbatim back-translation", reflecting emergent memory behavior; (4) A novel BLEU variant using Jieba segmentation and n-gram weighting is proposed. The study contributes to the empirical evaluation of Chinese NLP performance and advances understanding of cultural fidelity in AI-mediated translation.

Paper Structure

This paper contains 37 sections, 1 equation, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Conceptual Diagram: The Paradox of Poetic Intent in back-translation vs. emergent Quasi-Self-Awareness in LLM verbatim back-translation, where $ZHx$ is the original Chinese text, $EN$ is the translated English text and $ZHy$ is the back-translated Chinese text.
  • Figure 2: Comparison of translation metrics across models.
  • Figure 3: Pairwise scatter plot matrix with Spearman’s correlations and Benjamini-Hochberg correction.