Table of Contents
Fetching ...

When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

Lydia Nishimwe, Benoît Sagot, Rachel Bawden

TL;DR

The paper investigates the challenge of evaluating translations of user-generated content (UGC) by analyzing four UGC datasets to extract translation guidelines and a taxonomy of 12 non-standard phenomena plus 5 translation actions. It demonstrates that automatic evaluation scores are highly sensitive to prompting and the underlying guideline style, showing that guideline-aligned prompts can significantly shift COMET-based metrics. By comparing multiple models, including NLLB-3B and several instruction-tuned LLMs, the study reveals a spectrum of responsiveness to guidelines and highlights the risk of mismatched guidelines across datasets. The authors argue for guideline-aware evaluation frameworks and clearer dataset creation guidelines, proposing practical recommendations such as multi-version references or LLM-based judges configured with dataset-specific guidelines to enable fairer UGC translation evaluation.

Abstract

User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a "good" translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset's guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.

When the Gold Standard isn't Necessarily Standard: Challenges of Evaluating the Translation of User-Generated Content

TL;DR

The paper investigates the challenge of evaluating translations of user-generated content (UGC) by analyzing four UGC datasets to extract translation guidelines and a taxonomy of 12 non-standard phenomena plus 5 translation actions. It demonstrates that automatic evaluation scores are highly sensitive to prompting and the underlying guideline style, showing that guideline-aligned prompts can significantly shift COMET-based metrics. By comparing multiple models, including NLLB-3B and several instruction-tuned LLMs, the study reveals a spectrum of responsiveness to guidelines and highlights the risk of mismatched guidelines across datasets. The authors argue for guideline-aware evaluation frameworks and clearer dataset creation guidelines, proposing practical recommendations such as multi-version references or LLM-based judges configured with dataset-specific guidelines to enable fairer UGC translation evaluation.

Abstract

User-generated content (UGC) is characterised by frequent use of non-standard language, from spelling errors to expressive choices such as slang, character repetitions, and emojis. This makes evaluating UGC translation particularly challenging: what counts as a "good" translation depends on the level of standardness desired in the output. To explore this, we examine the human translation guidelines of four UGC datasets, and derive a taxonomy of twelve non-standard phenomena and five translation actions (NORMALISE, COPY, TRANSFER, OMIT, CENSOR). Our analysis reveals notable differences in how UGC is treated, resulting in a spectrum of standardness in reference translations. Through a case study on large language models (LLMs), we show that translation scores are highly sensitive to prompts with explicit translation instructions for UGC, and that they improve when these align with the dataset's guidelines. We argue that when preserving UGC style is important, fair evaluation requires both models and metrics to be aware of translation guidelines. Finally, we call for clear guidelines during dataset creation and for the development of controllable, guideline-aware evaluation frameworks for UGC translation.

Paper Structure

This paper contains 44 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Example of non-standard phenomena in English translated into French with specific actions. The grammatical error is corrected (Normalise), the irregular capitalisation and word elongation are translated into their equivalents in French (Transfer), and the repeated punctuation is copied (Copy).
  • Figure 2: COMET and COMET-Kiwi scores for translating UGC with and without corpus-specific guidelines.
  • Figure 3: Percentage of translation requests refused by the LLaMA model (prompted with corpus-specific guidelines) due to its internal self-censorship guidelines.
  • Figure 4: BLEU scores for translating UGC with and without corpus-specific guidelines.
  • Figure 5: Lexical overlap, measured in BLEU scores, between LLM translation outputs across all guidelines and for each dataset.