Context-Aware or Context-Insensitive? Assessing LLMs' Performance in Document-Level Translation
Wafaa Mohammed, Vlad Niculae
TL;DR
The paper tackles document-level translation by evaluating whether LLMs effectively use document context and how robust they are to context perturbations using perturbation and attribution analyses across diverse LLMs and encoder-decoder baselines. It finds that translation-finetuned LLMs can outperform encoder-decoder models in overall translation, but their pronoun-resolution performance lags, and they show robustness to random context—indicating limited context utilization. Attribution analyses reveal low reliance on relevant antecedents, underscoring the need for explicit context-aware fine-tuning and better datasets for discourse phenomena. The work provides fine-grained diagnostics for context usage in translation LLMs and highlights practical implications for deploying document-level translation systems.
Abstract
Large language models (LLMs) are increasingly strong contenders in machine translation. In this work, we focus on document-level translation, where some words cannot be translated without context from outside the sentence. Specifically, we investigate the ability of prominent LLMs to utilize the document context during translation through a perturbation analysis (analyzing models' robustness to perturbed and randomized document context) and an attribution analysis (examining the contribution of relevant context to the translation). We conduct an extensive evaluation across nine LLMs from diverse model families and training paradigms, including translation-specialized LLMs, alongside two encoder-decoder transformer baselines. We find that LLMs' improved document-translation performance compared to encoder-decoder models is not reflected in pronoun translation performance. Our analysis highlight the need for context-aware finetuning of LLMs with a focus on relevant parts of the context to improve their reliability for document-level translation.
