Prototypical Reward Network for Data-Efficient RLHF
Jinghan Zhang, Xiting Wang, Yiqiao Jin, Changyu Chen, Xinhao Zhang, Kunpeng Liu
TL;DR
This paper tackles the data efficiency challenge in RLHF by introducing Proto-RM, a reward-model framework that uses prototypical networks to learn from limited human feedback. By organizing embeddings into two prototype classes (chosen vs. rejected) and employing Infinite Mixture Prototypes with proximity-based updates and dropout-driven diversification, Proto-RM improves reward estimation and subsequent LLM fine-tuning with far less data. Across multiple datasets and ablations, Proto-RM demonstrates higher reward-model accuracy and better RLHF outcomes than baselines, including improved alignment with human preferences as measured by both automatic and human evaluations. The approach significantly reduces data requirements while preserving, and often enhancing, language quality and alignment, suggesting practical benefits for scalable, data-constrained RLHF deployment in LLMs.
Abstract
The reward model for Reinforcement Learning from Human Feedback (RLHF) has proven effective in fine-tuning Large Language Models (LLMs). Notably, collecting human feedback for RLHF can be resource-intensive and lead to scalability issues for LLMs and complex tasks. Our proposed framework Proto-RM leverages prototypical networks to enhance reward models under limited human feedback. By enabling stable and reliable structural learning from fewer samples, Proto-RM significantly enhances LLMs' adaptability and accuracy in interpreting human preferences. Extensive experiments on various datasets demonstrate that Proto-RM significantly improves the performance of reward models and LLMs in human feedback tasks, achieving comparable and usually better results than traditional methods, while requiring significantly less data. in data-limited scenarios. This research offers a promising direction for enhancing the efficiency of reward models and optimizing the fine-tuning of language models under restricted feedback conditions.
