Table of Contents
Fetching ...

Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation

Archchana Sindhujan, Girish A. Koushik, Shenbin Qian, Diptesh Kanojia, Constantin Orăsan

TL;DR

This work tackles the challenge of reference-free MT quality estimation by moving beyond scalar Direct Assessment scores to error-aware, contextual reasoning using Translation Quality Remarks (TQR). It introduces a segment-level English→Malayalam QE dataset (En→Ml) with DA and TQR, and proposes ALOPE-RL, a policy-based reinforcement learning framework built on GRPO that optimizes multiple rewards, including error categories and natural-language explanations, while fine-tuning compact LLMs with LoRA and 4-bit quantization. The approach yields state-of-the-art QE performance on En→Ml with only ~4K labeled examples and demonstrates that TQR provides a stronger, scalable weak supervision signal than token-level cues across several language pairs. The results underscore the value of error-aware, policy-optimized learning for QE under limited data and compute budgets, and the authors release datasets, code, and trained models to spur future research.

Abstract

Quality Estimation (QE) aims to assess the quality of machine translation (MT) outputs without relying on reference translations, making it essential for real-world, large-scale MT evaluation. Large Language Models (LLMs) have shown significant promise in advancing the field of quality estimation of machine translation. However, most of the QE approaches solely rely on scalar quality scores, offering no explicit information about the translation errors that should drive these judgments. Moreover, for low-resource languages where annotated QE data is limited, existing approaches struggle to achieve reliable performance. To address these challenges, we introduce the first segment-level QE dataset for English to Malayalam, a severely resource-scarce language pair in the QE domain, comprising human-annotated Direct Assessment (DA) scores and Translation Quality Remarks (TQR), which are short, contextual, free-form annotator comments that describe translation errors. We further introduce ALOPE-RL, a policy-based reinforcement learning framework that trains efficient adapters based on policy rewards derived from DA score and TQR. Integrating error-aware rewards with ALOPE-RL, enables LLMs to reason about translation quality beyond numeric scores. Despite being trained on a small-scale QE dataset, ALOPE-RL achieves state-of-the-art performance on English to Malayalam QE using compact LLMs (<=4B parameters}) fine-tuned with LoRA and 4-bit quantization, outperforming both larger LLM-based baselines and leading encoder-based QE models. Our results demonstrate that error-aware, policy-based learning can deliver strong QE performance under limited data and compute budgets. We release our dataset, code, and trained models to support future research.

Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation

TL;DR

This work tackles the challenge of reference-free MT quality estimation by moving beyond scalar Direct Assessment scores to error-aware, contextual reasoning using Translation Quality Remarks (TQR). It introduces a segment-level English→Malayalam QE dataset (En→Ml) with DA and TQR, and proposes ALOPE-RL, a policy-based reinforcement learning framework built on GRPO that optimizes multiple rewards, including error categories and natural-language explanations, while fine-tuning compact LLMs with LoRA and 4-bit quantization. The approach yields state-of-the-art QE performance on En→Ml with only ~4K labeled examples and demonstrates that TQR provides a stronger, scalable weak supervision signal than token-level cues across several language pairs. The results underscore the value of error-aware, policy-optimized learning for QE under limited data and compute budgets, and the authors release datasets, code, and trained models to spur future research.

Abstract

Quality Estimation (QE) aims to assess the quality of machine translation (MT) outputs without relying on reference translations, making it essential for real-world, large-scale MT evaluation. Large Language Models (LLMs) have shown significant promise in advancing the field of quality estimation of machine translation. However, most of the QE approaches solely rely on scalar quality scores, offering no explicit information about the translation errors that should drive these judgments. Moreover, for low-resource languages where annotated QE data is limited, existing approaches struggle to achieve reliable performance. To address these challenges, we introduce the first segment-level QE dataset for English to Malayalam, a severely resource-scarce language pair in the QE domain, comprising human-annotated Direct Assessment (DA) scores and Translation Quality Remarks (TQR), which are short, contextual, free-form annotator comments that describe translation errors. We further introduce ALOPE-RL, a policy-based reinforcement learning framework that trains efficient adapters based on policy rewards derived from DA score and TQR. Integrating error-aware rewards with ALOPE-RL, enables LLMs to reason about translation quality beyond numeric scores. Despite being trained on a small-scale QE dataset, ALOPE-RL achieves state-of-the-art performance on English to Malayalam QE using compact LLMs (<=4B parameters}) fine-tuned with LoRA and 4-bit quantization, outperforming both larger LLM-based baselines and leading encoder-based QE models. Our results demonstrate that error-aware, policy-based learning can deliver strong QE performance under limited data and compute budgets. We release our dataset, code, and trained models to support future research.
Paper Structure (25 sections, 3 equations, 8 figures, 9 tables)

This paper contains 25 sections, 3 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Example of the human annotated TQR and the generated synthetic explanations
  • Figure 2: Prompt template for synthetic data generation with Translation Quality Remarks.
  • Figure 3: Prompt template used for all ALOPE-RL experiments when human annotated TQR is utilized as weak supervision signal
  • Figure 4: Architecture diagram of ALOPE-RL
  • Figure 5: Performance comparison on En$\rightarrow$Ml across different weak-signal settings. IFT = Instruction Fine-Tuning; CR = Core rewards; AR = All rewards.
  • ...and 3 more figures