Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings
Miguel Moura Ramos, Tomás Almeida, Daniel Vareta, Filipe Azevedo, Sweta Agrawal, Patrick Fernandes, André F. T. Martins
TL;DR
The paper tackles reward sparsity in neural machine translation by introducing fine-grained token-level rewards derived from xCOMET-MQM severity spans. It adapts REINFORCE and PPO to operate at the token level, with a tokenization-agnostic mechanism that maps error spans to per-token rewards using a severity mapping, and uses GAE for stable optimization (gamma fixed at 1). Empirical results across EN→DE, EN→FR, and related directions show that token-level RL (tRL) improves neural metrics and human judgments, with notable gains for LLM-based MT systems and longer sequences, outperforming sentence-level RL and BLEU-based rewards. The findings establish token-level, severity-aware feedback as a practical and scalable approach to enhance translation quality and training stability in MT systems. The approach also highlights the superiority of xCOMET as a reward model over traditional lexical metrics in alignment with human judgments.
Abstract
Reinforcement learning (RL) has been proven to be an effective and robust method for training neural machine translation systems, especially when paired with powerful reward models that accurately assess translation quality. However, most research has focused on RL methods that use sentence-level feedback, leading to inefficient learning signals due to the reward sparsity problem -- the model receives a single score for the entire sentence. To address this, we propose a novel approach that leverages fine-grained, token-level quality assessments along with error severity levels using RL methods. Specifically, we use xCOMET, a state-of-the-art quality estimation system, as our token-level reward model. We conduct experiments on small and large translation datasets with standard encoder-decoder and large language models-based machine translation systems, comparing the impact of sentence-level versus fine-grained reward signals on translation quality. Our results show that training with token-level rewards improves translation quality across language pairs over baselines according to both automatic and human evaluation. Furthermore, token-level reward optimization improves training stability, evidenced by a steady increase in mean rewards over training epochs.
