Table of Contents
Fetching ...

Cross-lingual Transfer of Reward Models in Multilingual Alignment

Jiwoo Hong, Noah Lee, Rodrigo Martínez-Castaño, César Rodríguez, James Thorne

TL;DR

This work demonstrates that English-trained reward models (RMs) transfer strongly to non-English languages in multilingual RLHF setups, yielding consistent improvements on Multilingual RewardBench and benefiting downstream multilingual alignment. The authors show that English RMs best preserve the base MLM representations and that MLMs encode language-aware representations, explaining why English data can generalize across languages. They validate these findings with analyses of representation preservation and embedding norms, and extend the results to off-the-shelf RMs, including classifier and generative varieties. The results support using English RM data as a practical, cost-efficient strategy for multilingual alignment, and they provide release-ready code, models, and data to spur further research.

Abstract

Reinforcement learning with human feedback (RLHF) is shown to largely benefit from precise reward models (RMs). However, recent studies in reward modeling schemes are skewed towards English, limiting the applicability of RLHF in multilingual alignments. In this work, we investigate the cross-lingual transfer of RMs trained in diverse languages, primarily from English. Our experimental results demonstrate the strong cross-lingual transfer of English RMs, exceeding target language RMs by 3~4% average increase in Multilingual RewardBench. Furthermore, we analyze the cross-lingual transfer of RMs through the representation shifts. Finally, we perform multilingual alignment to exemplify how cross-lingual transfer in RM propagates to enhanced multilingual instruction-following capability, along with extensive analyses on off-the-shelf RMs. We release the code, model, and data.

Cross-lingual Transfer of Reward Models in Multilingual Alignment

TL;DR

This work demonstrates that English-trained reward models (RMs) transfer strongly to non-English languages in multilingual RLHF setups, yielding consistent improvements on Multilingual RewardBench and benefiting downstream multilingual alignment. The authors show that English RMs best preserve the base MLM representations and that MLMs encode language-aware representations, explaining why English data can generalize across languages. They validate these findings with analyses of representation preservation and embedding norms, and extend the results to off-the-shelf RMs, including classifier and generative varieties. The results support using English RM data as a practical, cost-efficient strategy for multilingual alignment, and they provide release-ready code, models, and data to spur further research.

Abstract

Reinforcement learning with human feedback (RLHF) is shown to largely benefit from precise reward models (RMs). However, recent studies in reward modeling schemes are skewed towards English, limiting the applicability of RLHF in multilingual alignments. In this work, we investigate the cross-lingual transfer of RMs trained in diverse languages, primarily from English. Our experimental results demonstrate the strong cross-lingual transfer of English RMs, exceeding target language RMs by 3~4% average increase in Multilingual RewardBench. Furthermore, we analyze the cross-lingual transfer of RMs through the representation shifts. Finally, we perform multilingual alignment to exemplify how cross-lingual transfer in RM propagates to enhanced multilingual instruction-following capability, along with extensive analyses on off-the-shelf RMs. We release the code, model, and data.

Paper Structure

This paper contains 36 sections, 5 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Proportion of the largest singular value in the concatenated hidden states for fixed context translated in five languages with RMs trained in each language. While English ("En") best preserves the representation diversity of the base model ("Inst'), Korean ("Ko") leads to the most homogeneous representations.
  • Figure 2: Embedding norm distribution comparison between English and four other languages (2 non-Latin (top), 2 Latin (bottom)) across four language models: OLMo-1B and SmolLM-1.7B (monolingual pre-training) and Qwen2.5-3B and Llama-3.2-3B (multilingual pre-training). While English and non-English token embedding norm distributions of OLMo-1B and SmolLM-1.7B are distinct, they are similar in Qwen2.5-3B and Llama-3.2-3B.
  • Figure 3: Multilingual AlpacaEval results of Qwen2.5-7B-Instruct models fine-tuned with DPO on on-policy generations for four non-English languages over fine runs. The alignment data were labeled with either English RM or target language RM. Results are averaged over 5 runs.