Table of Contents
Fetching ...

Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings

Miguel Moura Ramos, Tomás Almeida, Daniel Vareta, Filipe Azevedo, Sweta Agrawal, Patrick Fernandes, André F. T. Martins

TL;DR

The paper tackles reward sparsity in neural machine translation by introducing fine-grained token-level rewards derived from xCOMET-MQM severity spans. It adapts REINFORCE and PPO to operate at the token level, with a tokenization-agnostic mechanism that maps error spans to per-token rewards using a severity mapping, and uses GAE for stable optimization (gamma fixed at 1). Empirical results across EN→DE, EN→FR, and related directions show that token-level RL (tRL) improves neural metrics and human judgments, with notable gains for LLM-based MT systems and longer sequences, outperforming sentence-level RL and BLEU-based rewards. The findings establish token-level, severity-aware feedback as a practical and scalable approach to enhance translation quality and training stability in MT systems. The approach also highlights the superiority of xCOMET as a reward model over traditional lexical metrics in alignment with human judgments.

Abstract

Reinforcement learning (RL) has been proven to be an effective and robust method for training neural machine translation systems, especially when paired with powerful reward models that accurately assess translation quality. However, most research has focused on RL methods that use sentence-level feedback, leading to inefficient learning signals due to the reward sparsity problem -- the model receives a single score for the entire sentence. To address this, we propose a novel approach that leverages fine-grained, token-level quality assessments along with error severity levels using RL methods. Specifically, we use xCOMET, a state-of-the-art quality estimation system, as our token-level reward model. We conduct experiments on small and large translation datasets with standard encoder-decoder and large language models-based machine translation systems, comparing the impact of sentence-level versus fine-grained reward signals on translation quality. Our results show that training with token-level rewards improves translation quality across language pairs over baselines according to both automatic and human evaluation. Furthermore, token-level reward optimization improves training stability, evidenced by a steady increase in mean rewards over training epochs.

Fine-Grained Reward Optimization for Machine Translation using Error Severity Mappings

TL;DR

The paper tackles reward sparsity in neural machine translation by introducing fine-grained token-level rewards derived from xCOMET-MQM severity spans. It adapts REINFORCE and PPO to operate at the token level, with a tokenization-agnostic mechanism that maps error spans to per-token rewards using a severity mapping, and uses GAE for stable optimization (gamma fixed at 1). Empirical results across EN→DE, EN→FR, and related directions show that token-level RL (tRL) improves neural metrics and human judgments, with notable gains for LLM-based MT systems and longer sequences, outperforming sentence-level RL and BLEU-based rewards. The findings establish token-level, severity-aware feedback as a practical and scalable approach to enhance translation quality and training stability in MT systems. The approach also highlights the superiority of xCOMET as a reward model over traditional lexical metrics in alignment with human judgments.

Abstract

Reinforcement learning (RL) has been proven to be an effective and robust method for training neural machine translation systems, especially when paired with powerful reward models that accurately assess translation quality. However, most research has focused on RL methods that use sentence-level feedback, leading to inefficient learning signals due to the reward sparsity problem -- the model receives a single score for the entire sentence. To address this, we propose a novel approach that leverages fine-grained, token-level quality assessments along with error severity levels using RL methods. Specifically, we use xCOMET, a state-of-the-art quality estimation system, as our token-level reward model. We conduct experiments on small and large translation datasets with standard encoder-decoder and large language models-based machine translation systems, comparing the impact of sentence-level versus fine-grained reward signals on translation quality. Our results show that training with token-level rewards improves translation quality across language pairs over baselines according to both automatic and human evaluation. Furthermore, token-level reward optimization improves training stability, evidenced by a steady increase in mean rewards over training epochs.

Paper Structure

This paper contains 42 sections, 10 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Two examples are presented, both with identical sentence-level assessments but differing error severity and frequency. The reward model identifies translation error spans along with their corresponding severity levels. In these examples, we highlight both minor and major error spans. By mapping these spans to numerical values that reflect their severity, we can derive word-level scores/rewards. Since error spans can contain multiple words, we assume that all words within a given span share the same severity.
  • Figure 2: Sentence-level RL losses.
  • Figure 3: Mean rewards per training step for the IWSLT2017 EN→FR (top) and WMT18 EN→DE (bottom) datasets using xCOMET as the reward model with NLLB. The learning curves highlight training stability trends, where tRL (orange) displays greater stability than sRL (blue). Note that reward scales are not directly comparable due to differences in granularity and clipping methods.
  • Figure 4: COMET22 scores for NLLB (top), Tower (middle), and a comparative analysis of training and test data length distribution (bottom) on WMT24 EN→DE across increasing source sentence lengths, measured by character string length.
  • Figure 5: DA scores for sRL and tRL with Tower on WMT24 EN$\to$DE across increasing source sentence lengths.
  • ...and 1 more figures