Generalizing Reward Modeling for Out-of-Distribution Preference Learning
Chen Jia
TL;DR
This work tackles generalizing reward modeling for out-of-distribution preference learning (OOD PL) in LLM alignment. It introduces a gradient-based bilevel meta-learning framework where a single reward function is trained to guide policy optimization across multiple task distributions, mitigating policy drift and distribution shift via KL regularization. The outer objective optimizes preference alignment, while the inner objective performs task-specific policy fine-tuning; a convergence bound shows the method approaches a stationary point as the number of outer iterations grows and the reward factor $\beta$ increases. Empirically, the approach achieves state-of-the-art performance on controlled sentiment generation and knowledge answer generation across 20 held-out domains, with improvements in PL accuracy, RM-based rewards, and human-judgement metrics (including GPT-4 judgments). The results establish the practical value of meta-trained, general RM for robust OOD PL in real-world alignment tasks.
Abstract
Preference learning (PL) with large language models (LLMs) aims to align the LLMs' generations with human preferences. Previous work on reinforcement learning from human feedback (RLHF) has demonstrated promising results in in-distribution PL. However, due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging. Thus, out-of-distribution (OOD) PL is practically useful for enhancing the generalization ability of LLMs with limited preference feedback. This work addresses OOD PL by optimizing a general reward model through a meta-learning approach. During meta-training, a bilevel optimization algorithm is utilized to learn a reward model capable of guiding policy learning to align with human preferences across various distributions. When encountering a test distribution, the meta-test procedure conducts regularized policy optimization using the learned reward model for PL. We theoretically demonstrate the convergence rate of the bilevel optimization algorithm under reasonable assumptions. Additionally, we conduct experiments on two text generation tasks across 20 held-out domains and outperform a variety of strong baselines across various evaluation metrics.
