Table of Contents
Fetching ...

GRRM: Group Relative Reward Modeling for Machine Translation

Sen Yang, Shanbo Cheng, Lu Xu, Jianbing Zhang, Shujian Huang

TL;DR

Experimental results demonstrate that the Group Quality Metric (GQM) framework not only improves general translation quality but also unlocks reasoning capabilities comparable to state-of-the-art reasoning models.

Abstract

While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall short in this context; by evaluating candidates in isolation, they lack the comparative context necessary to distinguish fine-grained linguistic nuances. To address this, we introduce the Group Quality Metric (GQM) paradigm and instantiate it via the Group Relative Reward Model (GRRM). Unlike traditional independent scorers, GRRM processes the entire candidate group jointly, leveraging comparative analysis to rigorously resolve relative quality and adaptive granularity. Empirical evaluations confirm that GRRM achieves competitive ranking accuracy among all baselines. Building on this foundation, we integrate GRRM into the GRPO training loop to optimize the translation policy. Experimental results demonstrate that our framework not only improves general translation quality but also unlocks reasoning capabilities comparable to state-of-the-art reasoning models. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/GRRM.

GRRM: Group Relative Reward Modeling for Machine Translation

TL;DR

Experimental results demonstrate that the Group Quality Metric (GQM) framework not only improves general translation quality but also unlocks reasoning capabilities comparable to state-of-the-art reasoning models.

Abstract

While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall short in this context; by evaluating candidates in isolation, they lack the comparative context necessary to distinguish fine-grained linguistic nuances. To address this, we introduce the Group Quality Metric (GQM) paradigm and instantiate it via the Group Relative Reward Model (GRRM). Unlike traditional independent scorers, GRRM processes the entire candidate group jointly, leveraging comparative analysis to rigorously resolve relative quality and adaptive granularity. Empirical evaluations confirm that GRRM achieves competitive ranking accuracy among all baselines. Building on this foundation, we integrate GRRM into the GRPO training loop to optimize the translation policy. Experimental results demonstrate that our framework not only improves general translation quality but also unlocks reasoning capabilities comparable to state-of-the-art reasoning models. We release codes, datasets, and model checkpoints at https://github.com/NJUNLP/GRRM.
Paper Structure (63 sections, 13 equations, 6 figures, 7 tables)

This paper contains 63 sections, 13 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Performance on Seed-X-Challenge. Left: Ranking accuracy across paradigms. Right: Translation performance across General and MT-specialized LLMs.
  • Figure 2: Comparison of SQM (A) and GQM (B) paradigms.
  • Figure 3: Analysis of score saturation across seven benchmarks. We compare the Average Score (left) and Saturation Rate (right) under SQM and GQM paradigms using Gemini-1.5-Pro and DeepSeek-R1. SQM consistently exhibits score inflation and high saturation, whereas GQM effectively mitigates this issue, providing more discriminative evaluation.
  • Figure 4: Reward trends during GRPO training with 30-step moving averages, using SQM-GenRM vs. GRRM as reward providers (scores normalized to $[0,1]$). The SQM-GenRM rewards saturate early whereas GRRM remains in a non-saturated regime and continues to provide discriminative training signal.
  • Figure 5: Comparison of Scalar Quality Metric (SQM) and Group Quality Metric (GQM) across four distinct scenarios. Case 1 & 2 (Top): Demonstrate GQM's ability to resolve fine-grained stylistic nuances and identify localization errors that SQM misses due to independent evaluation. Case 3 & 4 (Bottom): Illustrate how GQM uses contrastive context to detect hallucinations and semantic omissions that appear fluent in isolation. Case 1,3,4 were conducted using Gemini-2.5-Pro and Case 2 was conducted using DeepSeek-R1-0528.
  • ...and 1 more figures