How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs
Ran Zhang, Wei Zhao, Steffen Eger
TL;DR
This work tackles the challenge of evaluating literary translation in the era of large language models by introducing LitEval-Corpus, a large, paragraph-level parallel corpus with verified human translations and outputs from nine MT systems across four language pairs. It systematically compares human evaluation schemes (MQM, SQM, BWS) and automatic, LLM-based metrics, revealing that MQM is often inadequate for literary translation, SQM's effectiveness depends on evaluator expertise, and BWS is best for distinguishing high-quality human translations from top systems. Automatic metrics, including GEMBA-MQM, correlate with human judgments only moderately and still fail to reliably separate human translations from LLM outputs, with LLMs tending to produce more literal and less diverse translations. The findings underscore that published human translations still outperform recent LLM translations and highlight the need for improved metrics that capture literary style and terminology, as well as further exploration of prompting and data contamination effects in literary MT evaluation.
Abstract
Recent research has focused on literary machine translation (MT) as a new challenge in MT. However, the evaluation of literary MT remains an open problem. We contribute to this ongoing discussion by introducing LITEVAL-CORPUS, a paragraph-level parallel corpus containing verified human translations and outputs from 9 MT systems, which totals over 2k translations and 13k evaluated sentences across four language pairs, costing 4.5k C. This corpus enables us to (i) examine the consistency and adequacy of human evaluation schemes with various degrees of complexity, (ii) compare evaluations by students and professionals, assess the effectiveness of (iii) LLM-based metrics and (iv) LLMs themselves. Our findings indicate that the adequacy of human evaluation is controlled by two factors: the complexity of the evaluation scheme (more complex is less adequate) and the expertise of evaluators (higher expertise yields more adequate evaluations). For instance, MQM (Multidimensional Quality Metrics), a complex scheme and the de facto standard for non-literary human MT evaluation, is largely inadequate for literary translation evaluation: with student evaluators, nearly 60% of human translations are misjudged as indistinguishable or inferior to machine translations. In contrast, BWS (BEST-WORST SCALING), a much simpler scheme, identifies human translations at a rate of 80-100%. Automatic metrics fare dramatically worse, with rates of at most 20%. Our overall evaluation indicates that published human translations consistently outperform LLM translations, where even the most recent LLMs tend to produce considerably more literal and less diverse translations compared to humans.
