Table of Contents
Fetching ...

Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety

Muskaan Chopra, Lorenz Sparrenberg, Rafet Sifa

TL;DR

This work tackles critical error detection in English-to-German machine translation by evaluating instruction-tuned large language models (LLMs) across model scales and adaptation regimes. It introduces a Cross-Family Scaling study that compares encoder-only baselines with decoder LLMs, examining zero-shot, few-shot, prompt-tuning, and LoRA-fine-tuning approaches on WMT-21, WMT-22, and SynCED-EnDe 2025, using MCC as the primary metric. Key findings show that instruction alignment and model scaling significantly improve detection of meaning-critical errors, with fine-tuned mid-sized models (e.g., GPT-4o-mini, LLaMA-3.1-8B) achieving near-saturation MCC and static gains from committee voting. The work argues for the societal relevance of reliable CED as a safeguard for trustworthy multilingual AI under the IR-for-Good paradigm, offering a practical path toward safer, more accountable cross-lingual information systems. Future directions include document-level CED, confidence calibration, and broader language coverage for real-world deployment.

Abstract

Machine Translation (MT) plays a pivotal role in cross-lingual information access, public policy communication, and equitable knowledge dissemination. However, critical meaning errors, such as factual distortions, intent reversals, or biased translations, can undermine the reliability, fairness, and safety of multilingual systems. In this work, we explore the capacity of instruction-tuned Large Language Models (LLMs) to detect such critical errors, evaluating models across a range of parameters using the publicly accessible data sets. Our findings show that model scaling and adaptation strategies (zero-shot, few-shot, fine-tuning) yield consistent improvements, outperforming encoder-only baselines like XLM-R and ModernBERT. We argue that improving critical error detection in MT contributes to safer, more trustworthy, and socially accountable information systems by reducing the risk of disinformation, miscommunication, and linguistic harm, especially in high-stakes or underrepresented contexts. This work positions error detection not merely as a technical challenge, but as a necessary safeguard in the pursuit of just and responsible multilingual AI. The code will be made available at GitHub.

Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety

TL;DR

This work tackles critical error detection in English-to-German machine translation by evaluating instruction-tuned large language models (LLMs) across model scales and adaptation regimes. It introduces a Cross-Family Scaling study that compares encoder-only baselines with decoder LLMs, examining zero-shot, few-shot, prompt-tuning, and LoRA-fine-tuning approaches on WMT-21, WMT-22, and SynCED-EnDe 2025, using MCC as the primary metric. Key findings show that instruction alignment and model scaling significantly improve detection of meaning-critical errors, with fine-tuned mid-sized models (e.g., GPT-4o-mini, LLaMA-3.1-8B) achieving near-saturation MCC and static gains from committee voting. The work argues for the societal relevance of reliable CED as a safeguard for trustworthy multilingual AI under the IR-for-Good paradigm, offering a practical path toward safer, more accountable cross-lingual information systems. Future directions include document-level CED, confidence calibration, and broader language coverage for real-world deployment.

Abstract

Machine Translation (MT) plays a pivotal role in cross-lingual information access, public policy communication, and equitable knowledge dissemination. However, critical meaning errors, such as factual distortions, intent reversals, or biased translations, can undermine the reliability, fairness, and safety of multilingual systems. In this work, we explore the capacity of instruction-tuned Large Language Models (LLMs) to detect such critical errors, evaluating models across a range of parameters using the publicly accessible data sets. Our findings show that model scaling and adaptation strategies (zero-shot, few-shot, fine-tuning) yield consistent improvements, outperforming encoder-only baselines like XLM-R and ModernBERT. We argue that improving critical error detection in MT contributes to safer, more trustworthy, and socially accountable information systems by reducing the risk of disinformation, miscommunication, and linguistic harm, especially in high-stakes or underrepresented contexts. This work positions error detection not merely as a technical challenge, but as a necessary safeguard in the pursuit of just and responsible multilingual AI. The code will be made available at GitHub.
Paper Structure (25 sections, 2 figures, 4 tables)

This paper contains 25 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Conceptual overview of the proposed Critical Error Detection (CED) framework. The pipeline progresses from task definition and dataset sources to diverse model families, encoder-only and decoder LLMs, followed by adaptation regimes and evaluation. Together, these components support safer multilingual information access through meaning-preserving translation.
  • Figure 2: Model scaling and adaptation impact on MCC across datasets (WMT-21, WMT-22, SynCED–2025). Encoder-only models (gray) form the lower baseline, while instruction-aligned LLMs (red: few-shot; blue: fine-tuned).