Everyone Deserves A Reward: Learning Customized Human Preferences

Pengyu Cheng; Jiawen Xie; Ke Bai; Yong Dai; Nan Du

Everyone Deserves A Reward: Learning Customized Human Preferences

Pengyu Cheng, Jiawen Xie, Ke Bai, Yong Dai, Nan Du

TL;DR

The paper tackles the challenge of learning customized human preferences for LLMs by introducing a Domain-Specific Preference (DSP) dataset and a three-stage training scheme (Base LM Training, General RM Fine-tuning, Customized RM Fine-tuning). It systematically evaluates data strategies and imitation-learning variants, finding that general-preference data enrichment and targeted imitation learning during CRFT help preserve broad alignment while accommodating domain-specific tastes. Through extensive experiments across multiple base models and preference datasets, the authors demonstrate practical methods to balance general and customized preferences with notable data efficiency. The DSP resource and methodological insights offer a pathway for domain-aware alignment in real-world, privacy-conscious applications.

Abstract

Reward models (RMs) are essential for aligning large language models (LLMs) with human preferences to improve interaction quality. However, the real world is pluralistic, which leads to diversified human preferences with respect to different religions, politics, cultures, etc. Moreover, each individual can have their unique preferences on various topics. Neglecting the diversity of human preferences, current human feedback aligning methods only consider a general reward model, which is below satisfaction for customized or personalized application scenarios. To explore customized preference learning, we collect a domain-specific preference (DSP) dataset, which includes preferred responses for each given query from four practical domains. Besides, from the perspective of data efficiency, we propose a three-stage customized RM learning scheme, then empirically verify its effectiveness on both general preference datasets and our DSP set. Furthermore, we test multiple training and data strategies on the three learning stages. We find several ways to better preserve the general preferring ability while training the customized RMs, especially general preference enrichment, and customized preference imitation learning. The DSP dataset and code are available at https://github.com/Linear95/DSP.

Everyone Deserves A Reward: Learning Customized Human Preferences

TL;DR

Abstract

Paper Structure (23 sections, 4 equations, 22 figures, 7 tables)

This paper contains 23 sections, 4 equations, 22 figures, 7 tables.

Introduction
Preliminary
Domain-specific Preference Dataset
Learning Customized Human Preferences
Experimental Details
Base Model Selection
Sample Sizes Comparison on General Fine-tuning
Imitation Learning on General Fine-tuning
Imitation Learning on Customized Fine-tuning
Without General Fine-tuning
Conclusion
Data Collection Details
Additional Results of General RM Fine-tuning
Pooling strategy comparison
Padding and Truncation Strategy Comparison
...and 8 more sections

Figures (22)

Figure 1: We propose a 3-stage training scheme for customized reward models.
Figure 2: Reward model structures. A pretrained large language model (LLM) is utilized as the base model. The input sequence of the reward model includes the input prompt and output response as well as the beginning/end of sentence tokens ([BOS]/[EOS]). The output hidden states of LLM are aggregated into a reward embedding, then the following reward head predicts a reward score. Besides, the LLM hidden states can be additionally trained to imitate the preferred response with a language modeling head providing the next-token prediction.
Figure 3: Data collection for domain-specific preferences. Using crafted system prompts (as shown in Code \ref{['code:ds-sys-prompt']}), we let ChatGPT act as an experienced practitioner in each domain and answer each user query as a domain-preferred response. For a particular domain (e.g.Academy), the response from it (solid gray arrow) is supposed to be preferred compared to the other domains' responses (dotted gray arrows) to the same question.
Figure 4: Clouds of words with top-100 TF-IDF scores in the four domains. The common words with top-100 frequency and stop words are excluded.
Figure 5: Testing performance of customized RM fine-tuning for LLM base comparison. The left-hand-side plot shows the accuracy gains on H&H set.
...and 17 more figures

Everyone Deserves A Reward: Learning Customized Human Preferences

TL;DR

Abstract

Everyone Deserves A Reward: Learning Customized Human Preferences

Authors

TL;DR

Abstract

Table of Contents

Figures (22)