Table of Contents
Fetching ...

How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa

TL;DR

This work investigates on-device Critical Error Detection for English→German MT using compact language models under 2B parameters. It introduces a standardized prompting framework with light logit-bias calibration, majority voting, and merged-weights fine-tuning, and benchmarks across WMT21, WMT22, and SynCED-EnDe-2025. Key findings show a sweet spot near one billion parameters (e.g., Gemma-3-1B) achieving strong semantic sensitivity while maintaining real-time latency on consumer hardware, with larger compact models like Qwen-3-1.7B offering higher MCC at higher compute costs. Ultra-small models struggle with entity/number errors but improve with few-shot calibration, indicating that careful adaptation can close much of the gap to larger evaluators without cloud dependence. The results advance edge-native language safety by proving that trustworthy multilingual CED can be achieved with compact, instruction-tuned LLMs, and they provide open prompts, datasets, and scripts for reproducibility and broader community use.

Abstract

Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.

How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

TL;DR

This work investigates on-device Critical Error Detection for English→German MT using compact language models under 2B parameters. It introduces a standardized prompting framework with light logit-bias calibration, majority voting, and merged-weights fine-tuning, and benchmarks across WMT21, WMT22, and SynCED-EnDe-2025. Key findings show a sweet spot near one billion parameters (e.g., Gemma-3-1B) achieving strong semantic sensitivity while maintaining real-time latency on consumer hardware, with larger compact models like Qwen-3-1.7B offering higher MCC at higher compute costs. Ultra-small models struggle with entity/number errors but improve with few-shot calibration, indicating that careful adaptation can close much of the gap to larger evaluators without cloud dependence. The results advance edge-native language safety by proving that trustworthy multilingual CED can be achieved with compact, instruction-tuned LLMs, and they provide open prompts, datasets, and scripts for reproducibility and broader community use.

Abstract

Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.

Paper Structure

This paper contains 33 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overall experimental pipeline for Critical Error Detection (CED). Each $(S_{\text{en}}, T_{\text{de}})$ pair is transformed into a structured prompt, passed through a compact LLM (LFM2-350M, Qwen 3, Gemma 3, or Llama 3.2), calibrated via logit bias and majority voting, and logged for evaluation metrics.
  • Figure 2: MCC as a function of model size (average across WMT21, WMT22, and SynCED-EnDe 2025). Performance rises steeply up to $\sim$1 B parameters and then saturates, indicating diminishing returns beyond the edge-deployable range.
  • Figure 3: Dataset-wise MCC across compact models. Gemma 3-1B consistently outperforms other models across WMT21, WMT22, and SynCED-EnDe 2025, with the largest relative improvement on the balanced SynCED corpus.
  • Figure 4: Latency-accuracy trade-off for compact LLMs (average MCC across datasets). Gemma 3-1B lies on the Pareto frontier, combining high semantic accuracy (MCC = 0.48) with sub-400 ms inference latency on an Apple M4 Pro (24 GB).