Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment
Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, Ahmad Beirami
TL;DR
This work tackles multilingual preference data scarcity by proposing reward-model transfer (RM transfer) to achieve zero-shot cross-lingual alignment. It trains a RM in a source language and uses it to guide alignment of a LM in a target language, evaluating on summarization and OpenAssistant-style dialog with human judgments and multilingual LM judges. Across diverse evaluation settings, RM transfer yields robust gains and can even outperform monolingual RM in some cases, while remaining effective when target-language SFT data is unavailable. The findings offer practical recommendations for data composition and language-source selection to broaden language coverage with reduced labeling costs.
Abstract
Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is trained on preference data in one source language and directly applied to other target languages. On summarization and open-ended dialog generation, we show that this method is consistently successful under comprehensive evaluation settings, including human evaluation: cross-lingually aligned models are preferred by humans over unaligned models on up to >70% of evaluation instances. We moreover find that a different-language reward model sometimes yields better aligned models than a same-language reward model. We also identify best practices when there is no language-specific data for even supervised finetuning, another component in alignment.
