PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation
Yongfu Xue
TL;DR
PIRA addresses two core RLHF challenges: data-inefficient instruction use and reward overoptimization. It reformulates QA data into explicit preference-oriented instructions, and introduces dual aggregation—across instruction sets and stochastic value-head realizations—to stabilize rewards and reduce bias. The approach yields improved alignment and robustness across multiple models and datasets, and demonstrates better data efficiency and cross-domain generalization, with mitigated reward hacking during PPO fine-tuning. These results suggest PIRA as a practical, scalable framework for more reliable preference-aligned LLM training in real-world settings.
Abstract
Reward models are crucial for aligning Large Language Models (LLMs) with human preferences but face two representative challenges. First, traditional discriminative reward models usually concatenate questions and responses directly as input, resulting in low data efficiency. Second, reward models are vulnerable to reward overoptimization. We propose PIRA, a training paradigm addressing these issues through three strategies: (1) Reformulating question-answer pairs into preference-based instructions for clearer and more explicit task specification, (2) aggregating rewards from diverse preference tasks to reduce bias and improve robustness, and (3) averaging value-head outputs under varying dropout rates to stabilize rewards. Extensive experiments have demonstrated the effectiveness of PIRA.
