Table of Contents
Fetching ...

Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content?

Shenbin Qian, Constantin Orăsan, Diptesh Kanojia, Félix do Carmo

TL;DR

The paper investigates whether large language models can serve as state-of-the-art, reference-less quality estimators for MT of emotion-loaded user-generated content by leveraging MQM-based scores. It contrasts in-context learning and parameter-efficient fine-tuning (using LoRA on 4-bit LLMs) against traditional fine-tuning baselines, using HADQAET as emotion-rich data and MQM-derived ground truth. Key findings show PEFT-LMMs can surpass baselines in score-prediction accuracy with human-readable explanations, though practical issues—such as refusals to respond and output instability—complicate their deployment for QE. The study highlights the importance of emotion-preserving evaluation, proposes novel prompts, and outlines practical considerations for applying LLM-based QE to emotion-rich MT tasks. Overall, PEFT-enabled LLMs offer a promising, interpretable QE pathway, but safety, stability, and resource-cost concerns remain areas for future work.

Abstract

This paper investigates whether large language models (LLMs) are state-of-the-art quality estimators for machine translation of user-generated content (UGC) that contains emotional expressions, without the use of reference translations. To achieve this, we employ an existing emotion-related dataset with human-annotated errors and calculate quality evaluation scores based on the Multi-dimensional Quality Metrics. We compare the accuracy of several LLMs with that of our fine-tuned baseline models, under in-context learning and parameter-efficient fine-tuning (PEFT) scenarios. We find that PEFT of LLMs leads to better performance in score prediction with human interpretable explanations than fine-tuned models. However, a manual analysis of LLM outputs reveals that they still have problems such as refusal to reply to a prompt and unstable output while evaluating machine translation of UGC.

Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content?

TL;DR

The paper investigates whether large language models can serve as state-of-the-art, reference-less quality estimators for MT of emotion-loaded user-generated content by leveraging MQM-based scores. It contrasts in-context learning and parameter-efficient fine-tuning (using LoRA on 4-bit LLMs) against traditional fine-tuning baselines, using HADQAET as emotion-rich data and MQM-derived ground truth. Key findings show PEFT-LMMs can surpass baselines in score-prediction accuracy with human-readable explanations, though practical issues—such as refusals to respond and output instability—complicate their deployment for QE. The study highlights the importance of emotion-preserving evaluation, proposes novel prompts, and outlines practical considerations for applying LLM-based QE to emotion-rich MT tasks. Overall, PEFT-enabled LLMs offer a promising, interpretable QE pathway, but safety, stability, and resource-cost concerns remain areas for future work.

Abstract

This paper investigates whether large language models (LLMs) are state-of-the-art quality estimators for machine translation of user-generated content (UGC) that contains emotional expressions, without the use of reference translations. To achieve this, we employ an existing emotion-related dataset with human-annotated errors and calculate quality evaluation scores based on the Multi-dimensional Quality Metrics. We compare the accuracy of several LLMs with that of our fine-tuned baseline models, under in-context learning and parameter-efficient fine-tuning (PEFT) scenarios. We find that PEFT of LLMs leads to better performance in score prediction with human interpretable explanations than fine-tuned models. However, a manual analysis of LLM outputs reveals that they still have problems such as refusal to reply to a prompt and unstable output while evaluating machine translation of UGC.
Paper Structure (20 sections, 7 figures, 4 tables)

This paper contains 20 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Example of translations from Google Translate and ChatGPT
  • Figure 2: Prompt Template 1
  • Figure 3: Prompt Template 2
  • Figure 4: An example of refusal to reply because of interjections
  • Figure 5: An example of refusal to reply because of "sensitive" words
  • ...and 2 more figures