How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation
Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa
TL;DR
This work investigates on-device Critical Error Detection for English→German MT using compact language models under 2B parameters. It introduces a standardized prompting framework with light logit-bias calibration, majority voting, and merged-weights fine-tuning, and benchmarks across WMT21, WMT22, and SynCED-EnDe-2025. Key findings show a sweet spot near one billion parameters (e.g., Gemma-3-1B) achieving strong semantic sensitivity while maintaining real-time latency on consumer hardware, with larger compact models like Qwen-3-1.7B offering higher MCC at higher compute costs. Ultra-small models struggle with entity/number errors but improve with few-shot calibration, indicating that careful adaptation can close much of the gap to larger evaluators without cloud dependence. The results advance edge-native language safety by proving that trustworthy multilingual CED can be achieved with compact, instruction-tuned LLMs, and they provide open prompts, datasets, and scripts for reproducibility and broader community use.
Abstract
Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.
