Prototypical Reward Network for Data-Efficient RLHF

Jinghan Zhang; Xiting Wang; Yiqiao Jin; Changyu Chen; Xinhao Zhang; Kunpeng Liu

Prototypical Reward Network for Data-Efficient RLHF

Jinghan Zhang, Xiting Wang, Yiqiao Jin, Changyu Chen, Xinhao Zhang, Kunpeng Liu

TL;DR

This paper tackles the data efficiency challenge in RLHF by introducing Proto-RM, a reward-model framework that uses prototypical networks to learn from limited human feedback. By organizing embeddings into two prototype classes (chosen vs. rejected) and employing Infinite Mixture Prototypes with proximity-based updates and dropout-driven diversification, Proto-RM improves reward estimation and subsequent LLM fine-tuning with far less data. Across multiple datasets and ablations, Proto-RM demonstrates higher reward-model accuracy and better RLHF outcomes than baselines, including improved alignment with human preferences as measured by both automatic and human evaluations. The approach significantly reduces data requirements while preserving, and often enhancing, language quality and alignment, suggesting practical benefits for scalable, data-constrained RLHF deployment in LLMs.

Abstract

The reward model for Reinforcement Learning from Human Feedback (RLHF) has proven effective in fine-tuning Large Language Models (LLMs). Notably, collecting human feedback for RLHF can be resource-intensive and lead to scalability issues for LLMs and complex tasks. Our proposed framework Proto-RM leverages prototypical networks to enhance reward models under limited human feedback. By enabling stable and reliable structural learning from fewer samples, Proto-RM significantly enhances LLMs' adaptability and accuracy in interpreting human preferences. Extensive experiments on various datasets demonstrate that Proto-RM significantly improves the performance of reward models and LLMs in human feedback tasks, achieving comparable and usually better results than traditional methods, while requiring significantly less data. in data-limited scenarios. This research offers a promising direction for enhancing the efficiency of reward models and optimizing the fine-tuning of language models under restricted feedback conditions.

Prototypical Reward Network for Data-Efficient RLHF

TL;DR

Abstract

Paper Structure (37 sections, 12 equations, 6 figures, 8 tables, 1 algorithm)

This paper contains 37 sections, 12 equations, 6 figures, 8 tables, 1 algorithm.

Introduction
This Work.
Contributions.
Related Work
Reinforcement Learning from Human Feedback (RLHF)
Prototypical Networks
Problem Formulation
Input.
Output.
Methodology
Reward Model with Prototypical Network
Reward Model for RLHF.
Prototypical Network.
Reward Model with Prototypical Network
Prototype Initialization.
...and 22 more sections

Figures (6)

Figure 1: Our proposed Proto-RM framework enhances reward model with prototypical networks. Left: Humans annotate pairwise RLHF data and select their preferred text. Middle: Proto-RM aggregates similar examples from embedding space into prototypes. Right: The enhanced reward model fine-tunes a pretrained LLM.
Figure 2: The framework consists of three components: 1) Reward model embedding, 2) Proto-RM adjustment and 3) RLHF process. The reward model compress and align the sample text pair embeddings to produce representative prototypes, and the prototypes adjust the embeddings to update the reward model.
Figure 3: Comparison of reward models' accuracy on 5%, 10%, and 20% datasets.
Figure 4: Performance of LLM with reward model fine-tuning.
Figure 5: Impacts of Dropout. Models incorporating dropout exhibit higher accuracy, with Cosine Similarity Dropout performing slightly better than Random Dropout.
...and 1 more figures

Prototypical Reward Network for Data-Efficient RLHF

TL;DR

Abstract

Prototypical Reward Network for Data-Efficient RLHF

Authors

TL;DR

Abstract

Table of Contents

Figures (6)