Table of Contents
Fetching ...

mR3: Multilingual Rubric-Agnostic Reward Reasoning Models

David Anugraha, Shou-Yi Hung, Zilu Tang, Annie En-Shiun Lee, Derry Tanti Wijaya, Genta Indra Winata

TL;DR

The paper tackles the challenge of evaluating LLMs across languages by introducing mR3, a task-agnostic, rubric-agnostic reward modeling framework trained on 72 languages. It systematically curates a large multilingual dataset, explores multiple prompting and reasoning languages, and employs curriculum-based supervised fine-tuning to build strong multilingual reward models. mR3 achieves state-of-the-art performance on multilingual benchmarks, often outperforming much larger models while remaining significantly smaller, and demonstrates improved reasoning faithfulness across languages. The work highlights how English prompts remain strong but targeted training in the target language substantially narrows the gap, enhancing interpretability and accessibility for non-English users, and it provides open-source models, data, and code to spur further progress in multilingual evaluation.

Abstract

Evaluation using Large Language Model (LLM) judges has been widely adopted in English and shown to be effective for automatic evaluation. However, their performance does not generalize well to non-English settings, and it remains unclear what constitutes effective multilingual training for such judges. In this paper, we introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages, achieving the broadest language coverage in reward modeling to date. We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models, including the integration of target-language reasoning datasets. Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models (i.e., GPT-OSS-120B) while being up to 9x smaller, and its effectiveness is further confirmed through extensive ablation studies. Our models, data, and code are available as open source at https://github.com/rubricreward/mr3.

mR3: Multilingual Rubric-Agnostic Reward Reasoning Models

TL;DR

The paper tackles the challenge of evaluating LLMs across languages by introducing mR3, a task-agnostic, rubric-agnostic reward modeling framework trained on 72 languages. It systematically curates a large multilingual dataset, explores multiple prompting and reasoning languages, and employs curriculum-based supervised fine-tuning to build strong multilingual reward models. mR3 achieves state-of-the-art performance on multilingual benchmarks, often outperforming much larger models while remaining significantly smaller, and demonstrates improved reasoning faithfulness across languages. The work highlights how English prompts remain strong but targeted training in the target language substantially narrows the gap, enhancing interpretability and accessibility for non-English users, and it provides open-source models, data, and code to spur further progress in multilingual evaluation.

Abstract

Evaluation using Large Language Model (LLM) judges has been widely adopted in English and shown to be effective for automatic evaluation. However, their performance does not generalize well to non-English settings, and it remains unclear what constitutes effective multilingual training for such judges. In this paper, we introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages, achieving the broadest language coverage in reward modeling to date. We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models, including the integration of target-language reasoning datasets. Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models (i.e., GPT-OSS-120B) while being up to 9x smaller, and its effectiveness is further confirmed through extensive ablation studies. Our models, data, and code are available as open source at https://github.com/rubricreward/mr3.

Paper Structure

This paper contains 63 sections, 5 equations, 3 figures, 20 tables.

Figures (3)

  • Figure 1: The $\textcolor{black}{mR3}$ model supports multilingual input and enables reasoning outputs to be tailored to user preferences. $\textcolor{black}{mR3}$ can process information, perform reasoning, and generate responses across multiple languages.
  • Figure 2: mR3 dataset construction that is aligned across different multilingual settings to highlight the trade-offs between using English and the input language for the prompts and reasoning traces. Here, prompt denotes both instruction and rubric, eng denotes English, and tgt denotes target language based on the input. A training sample is accepted if (1) all outputs distilled from gpt-oss-120b using different prompting and reasoning languages are correct, and (2) gpt-oss-20b does not solve it consistently after being sampled five times.
  • Figure 3: Average performance of the $\textcolor{black}{mR3}$ models (solid bars) and their base models (hatched bars) across different parameter sizes and multilingual prompting and reasoning strategies. The performance of each $\textcolor{black}{mR3}$ model consistently improves its corresponding base model for every different strategy, especially when thinking in the target language, which is important.