Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

Zhaofeng Wu; Ananth Balashankar; Yoon Kim; Jacob Eisenstein; Ahmad Beirami

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

Zhaofeng Wu, Ananth Balashankar, Yoon Kim, Jacob Eisenstein, Ahmad Beirami

TL;DR

This work tackles multilingual preference data scarcity by proposing reward-model transfer (RM transfer) to achieve zero-shot cross-lingual alignment. It trains a RM in a source language and uses it to guide alignment of a LM in a target language, evaluating on summarization and OpenAssistant-style dialog with human judgments and multilingual LM judges. Across diverse evaluation settings, RM transfer yields robust gains and can even outperform monolingual RM in some cases, while remaining effective when target-language SFT data is unavailable. The findings offer practical recommendations for data composition and language-source selection to broaden language coverage with reduced labeling costs.

Abstract

Aligning language models (LMs) based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is trained on preference data in one source language and directly applied to other target languages. On summarization and open-ended dialog generation, we show that this method is consistently successful under comprehensive evaluation settings, including human evaluation: cross-lingually aligned models are preferred by humans over unaligned models on up to >70% of evaluation instances. We moreover find that a different-language reward model sometimes yields better aligned models than a same-language reward model. We also identify best practices when there is no language-specific data for even supervised finetuning, another component in alignment.

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

TL;DR

Abstract

Paper Structure (39 sections, 1 equation, 11 figures, 25 tables)

This paper contains 39 sections, 1 equation, 11 figures, 25 tables.

Introduction
Background: Alignment From Human Feedback
The SFT stage
The RM stage
The reward optimization stage
Reward Model Transfer for Cross-Lingual Alignment
Experimental Setup
Summarization.
Open-Ended Dialog Generation.
Evaluation.
Results
Cross-Lingual Alignment Is Effective
Cross-Lingual Alignment Sometimes Outperforms Monolingual Alignment
Cross-Lingual Alignment Without Target-Language SFT Data
Practical Recommendations
...and 24 more sections

Figures (11)

Figure 1: Cross-lingual reward model (RM) transfer. To align in a target language (in this example, Spanish), common monolingual alignment uses a RM for that target language. Instead, we re-purpose a RM for a different source language (in this example, English).
Figure 2: Performing target-language alignment using a RM for a different source language improves performance, when evaluated exclusively in the target language. This improvement is sometimes even larger than using the target-language RM (monolingual alignment). Here we measure the win rate against the target-language (unaligned) SFT model judged by humans, and the 95% confidence interval across validation instances. "source$\to$target" denotes using a source-language RM to drive alignment in the target language.
Figure 3: Cross-lingual alignment effectiveness judged by a finetuned target-language RM evaluator, measured in its score increase between the aligned model and the target-language SFT model. Each group in (a) and subplot in (b) represents one target language, and different dots/lines within each represent different source languages. RL is difficult to train for OpenAssistant (§\ref{['sec:setup']}), so we omit it here. In most cases, the RM evaluator score improves for cross-lingually aligned models.
Figure 4: Alignment effectiveness, compared to the target-language SFT model judged by PaLM-2-L, and the 95% confidence interval across validation instances. "source$\to$target" denotes a source-language RM driving alignment in the target language. Cross-lingual alignment is generally effective, sometimes outperforming monolingual alignment. RL is hard to train for OpenAssistant, in line with what its authors found kopf2023openassistant.
Figure 5: Cross-lingual alignment results without target-language SFT data using various strategies and on different data. Training the SFT model using data translated from another language can be helpful when aligning using RL ((d)), but domain match is important for best-of-$n$ ((c) and the back-translation results).
...and 6 more figures

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

TL;DR

Abstract

Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment

Authors

TL;DR

Abstract

Table of Contents

Figures (11)