Investigating Length Issues in Document-level Machine Translation
Ziqian Peng, Rachel Bawden, François Yvon
TL;DR
The paper investigates length-related degradation in document-level machine translation by designing a controlled methodology that varies input length and sentence position, evaluated on two architectures with doc-to-doc translation. It introduces ds-BLEU and a paired testing framework to quantify length and position effects, and proposes unifPE to balance training exposure to positional encodings. Key findings show that translation quality generally declines as document length grows, with earlier sentences translating more reliably than later ones; distributing or reshaping positional encoding exposure yields model-dependent improvements, notably for encoder-decoder NLLB-based systems, while RoPE-based TowerBase remains less responsive. The work reinforces that sentence-level MT remains a strong baseline and highlights avenues for improving document-level MT, such as attention constraints to mimic sentence alignment or better memorization strategies, with broader implications for evaluation and training practices in long-context NLP.
Abstract
Transformer architectures are increasingly effective at processing and generating very long chunks of texts, opening new perspectives for document-level machine translation (MT). In this work, we challenge the ability of MT systems to handle texts comprising up to several thousands of tokens. We design and implement a new approach designed to precisely measure the effect of length increments on MT outputs. Our experiments with two representative architectures unambiguously show that (a)~translation performance decreases with the length of the input text; (b)~the position of sentences within the document matters, and translation quality is higher for sentences occurring earlier in a document. We further show that manipulating the distribution of document lengths and of positional embeddings only marginally mitigates such problems. Our results suggest that even though document-level MT is computationally feasible, it does not yet match the performance of sentence-based MT.
