Table of Contents
Fetching ...

On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation

Eleftheria Briakou, Zhongtao Liu, Colin Cherry, Markus Freitag

TL;DR

This study analyzes how verbose outputs from large language models affect translation evaluation using WMT-2024 data. It introduces a prompt-based labeling approach to detect three verbosity behaviors (refusal to translate, multiple translations, and added commentary) across eight language pairs and eight models. The results show that verbosity is common and model- and language-dependent, with safety and non-linguistic content driving refusals and longer input contexts driving contextualized explanations. Importantly, discarding verbose outputs alters automatic and human evaluation rankings, indicating that current metrics may unfairly penalize or misrank verbose models, which calls for evaluation frameworks that account for contextualized model outputs and refusals.

Abstract

This paper investigates the impact of verbose LLM translations on evaluation. We first demonstrate the prevalence of this behavior across several LLM outputs drawn from the WMT 2024 general shared task on machine translation. We then identify the primary triggers of verbosity, including safety, copyright concerns, and insufficient context in short input queries. Finally, we show that ignoring this behavior unfairly penalizes more verbose LLMs according to both automatic and human evaluations, highlighting the need to address this issue for more accurate future evaluations.

On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation

TL;DR

This study analyzes how verbose outputs from large language models affect translation evaluation using WMT-2024 data. It introduces a prompt-based labeling approach to detect three verbosity behaviors (refusal to translate, multiple translations, and added commentary) across eight language pairs and eight models. The results show that verbosity is common and model- and language-dependent, with safety and non-linguistic content driving refusals and longer input contexts driving contextualized explanations. Importantly, discarding verbose outputs alters automatic and human evaluation rankings, indicating that current metrics may unfairly penalize or misrank verbose models, which calls for evaluation frameworks that account for contextualized model outputs and refusals.

Abstract

This paper investigates the impact of verbose LLM translations on evaluation. We first demonstrate the prevalence of this behavior across several LLM outputs drawn from the WMT 2024 general shared task on machine translation. We then identify the primary triggers of verbosity, including safety, copyright concerns, and insufficient context in short input queries. Finally, we show that ignoring this behavior unfairly penalizes more verbose LLMs according to both automatic and human evaluations, highlighting the need to address this issue for more accurate future evaluations.
Paper Structure (16 sections, 6 figures, 7 tables)

This paper contains 16 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Verbosity in llm translation responses.
  • Figure 2: Number of translation outputs detected as being (a) verbose, along with heatmaps showing how verbosity is distributed across (b) denial to translate and (c) commentary cases.
  • Figure 3: Distribution of mqm error categories for verbose translations across domains, in German.
  • Figure 4: Distribution of translation refusals across sub-classes and domains.
  • Figure 5: Distribution of commentary behavior across sub-classes and domains.
  • ...and 1 more figures