Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety
Muskaan Chopra, Lorenz Sparrenberg, Rafet Sifa
TL;DR
This work tackles critical error detection in English-to-German machine translation by evaluating instruction-tuned large language models (LLMs) across model scales and adaptation regimes. It introduces a Cross-Family Scaling study that compares encoder-only baselines with decoder LLMs, examining zero-shot, few-shot, prompt-tuning, and LoRA-fine-tuning approaches on WMT-21, WMT-22, and SynCED-EnDe 2025, using MCC as the primary metric. Key findings show that instruction alignment and model scaling significantly improve detection of meaning-critical errors, with fine-tuned mid-sized models (e.g., GPT-4o-mini, LLaMA-3.1-8B) achieving near-saturation MCC and static gains from committee voting. The work argues for the societal relevance of reliable CED as a safeguard for trustworthy multilingual AI under the IR-for-Good paradigm, offering a practical path toward safer, more accountable cross-lingual information systems. Future directions include document-level CED, confidence calibration, and broader language coverage for real-world deployment.
Abstract
Machine Translation (MT) plays a pivotal role in cross-lingual information access, public policy communication, and equitable knowledge dissemination. However, critical meaning errors, such as factual distortions, intent reversals, or biased translations, can undermine the reliability, fairness, and safety of multilingual systems. In this work, we explore the capacity of instruction-tuned Large Language Models (LLMs) to detect such critical errors, evaluating models across a range of parameters using the publicly accessible data sets. Our findings show that model scaling and adaptation strategies (zero-shot, few-shot, fine-tuning) yield consistent improvements, outperforming encoder-only baselines like XLM-R and ModernBERT. We argue that improving critical error detection in MT contributes to safer, more trustworthy, and socially accountable information systems by reducing the risk of disinformation, miscommunication, and linguistic harm, especially in high-stakes or underrepresented contexts. This work positions error detection not merely as a technical challenge, but as a necessary safeguard in the pursuit of just and responsible multilingual AI. The code will be made available at GitHub.
