On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation
Eleftheria Briakou, Zhongtao Liu, Colin Cherry, Markus Freitag
TL;DR
This study analyzes how verbose outputs from large language models affect translation evaluation using WMT-2024 data. It introduces a prompt-based labeling approach to detect three verbosity behaviors (refusal to translate, multiple translations, and added commentary) across eight language pairs and eight models. The results show that verbosity is common and model- and language-dependent, with safety and non-linguistic content driving refusals and longer input contexts driving contextualized explanations. Importantly, discarding verbose outputs alters automatic and human evaluation rankings, indicating that current metrics may unfairly penalize or misrank verbose models, which calls for evaluation frameworks that account for contextualized model outputs and refusals.
Abstract
This paper investigates the impact of verbose LLM translations on evaluation. We first demonstrate the prevalence of this behavior across several LLM outputs drawn from the WMT 2024 general shared task on machine translation. We then identify the primary triggers of verbosity, including safety, copyright concerns, and insufficient context in short input queries. Finally, we show that ignoring this behavior unfairly penalizes more verbose LLMs according to both automatic and human evaluations, highlighting the need to address this issue for more accurate future evaluations.
