Large language models effectively leverage document-level context for literary translation, but critical errors persist

Marzena Karpinska; Mohit Iyyer

Large language models effectively leverage document-level context for literary translation, but critical errors persist

Marzena Karpinska, Mohit Iyyer

TL;DR

This study investigates whether large language models can leverage document-level context to improve literary translation. By comparing sentence-level, paragraph-contextualized sentence-level, and full-paragraph prompts across 18 language pairs, the authors conduct a rigorous human evaluation on 360 aligned paragraphs and accompany it with automatic metrics. They show that translating entire paragraphs (Para) generally yields higher-quality translations with better coherence and style preservation, though notable omissions and context-sensitive errors persist, underscoring the continued need for human oversight. The work contributes a large, publicly released dataset with fine-grained error annotations and demonstrates the promise and limits of LLM-based document-level literary translation, pointing to future work on integrating paragraph translations into cohesive chapters and novels.

Abstract

Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets. However, their ability to translate paragraphs and documents remains unexplored because evaluation in these settings is costly and difficult. We show through a rigorous human evaluation that asking the Gpt-3.5 (text-davinci-003) LLM to translate an entire literary paragraph (e.g., from a novel) at once results in higher-quality translations than standard sentence-by-sentence translation across 18 linguistically-diverse language pairs (e.g., translating into and out of Japanese, Polish, and English). Our evaluation, which took approximately 350 hours of effort for annotation and analysis, is conducted by hiring translators fluent in both the source and target language and asking them to provide both span-level error annotations as well as preference judgments of which system's translations are better. We observe that discourse-level LLM translators commit fewer mistranslations, grammar errors, and stylistic inconsistencies than sentence-level approaches. With that said, critical errors still abound, including occasional content omissions, and a human translator's intervention remains necessary to ensure that the author's voice remains intact. We publicly release our dataset and error annotations to spur future research on evaluation of document-level literary translation.

Large language models effectively leverage document-level context for literary translation, but critical errors persist

TL;DR

Abstract

Paper Structure (59 sections, 1 equation, 11 figures, 23 tables)

This paper contains 59 sections, 1 equation, 11 figures, 23 tables.

Introduction
Why literary texts?
Why human evaluation?
How do we use LLMs to translate paragraphs?
LLMs produce better translations when provided with paragraph-level context:
Background
Existing approaches to document-level translation
Translation with large language models
Data & methods
Dataset collection
Selecting paragraphs from novels:
Paragraph length:
Target language selection:
Source language selection:
Source language selection:
...and 44 more sections

Figures (11)

Figure 1: A plot of the total number of errors annotated in sentence-level (Sent) and paragraph-level (Para) translations produced by Gpt-3.5 across 18 different language pairs. In all cases, Para produces fewer errors than Sent, which demonstrates that Gpt-3.5 takes advantage of discourse context during translation.
Figure 2: An example of paragraph-level (Para) and sentence-level (Sent) translations of the same Japanese paragraph into English. Sentence-level translation results in a range of erroneous translations, from worse word choice ("understood" vs "right away") to incorrect pronouns ("he" vs "I").
Figure 3: A description of the annotation process for a pair of candidate translations given a source paragraph. Note that our hired translators go through this pipeline for three different pairs per source paragraph, comparing Para with Sent, Para_Sent, and GTr.
Figure 4: The distribution of translator preference judgments between sentence-level translation (Sent) and paragraph-level translation (Para). Para is preferred (i.e., more votes) in every language pair except de-ja, fr-en and de-en.
Figure 5: The number of votes for Sent vs Para, Para_Sent vs Para, and GTr vs Para along with rater confidence (confident or unsure). Para is preferred to all competing methods. All differences are statistically significant at p<.001 (binomial test).
...and 6 more figures

Large language models effectively leverage document-level context for literary translation, but critical errors persist

TL;DR

Abstract

Large language models effectively leverage document-level context for literary translation, but critical errors persist

Authors

TL;DR

Abstract

Table of Contents

Figures (11)