Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data
Shuai Zhao, Yunqiu Xu, Linchao Zhu, Yi Yang
TL;DR
This work addresses the resource-intensive nature of collecting binary human preferences for LLM alignment by introducing RefAlign, a REINFORCE-style algorithm that uses similarity between generated responses and unary reference answers as a surrogate reward. RefAlign removes the need for reward models and binary preference data and can be extended to safety and confidence alignment by incorporating task-specific reward components. Across safety, confidence, and general preference settings, RefAlign achieves performance comparable to traditional reward-model-based methods and can leverage powerful LM references when human references are unavailable. This approach highlights a practical path toward simpler, more scalable preference optimization and supports effective preference distillation from large models.
Abstract
Large language models~(LLMs) are expected to be helpful, harmless, and honest. In different alignment scenarios, such as safety, confidence, and general preference alignment, binary preference data collection and reward modeling are resource-intensive but play a central role in transferring human preferences. In this work, we explore using the similarity between sampled generations and reference answers as a supplementary reward function for alignment. When unary reference answers are available, such similarity-based rewards can circumvent the need for binary preference data and explicit reward modeling. We introduce \textit{RefAlign}, a versatile REINFORCE-style alignment algorithm that does not rely on reward or reference models. RefAlign utilizes language generation evaluation metrics, such as BERTScore, between sampled generations and reference answers as surrogate rewards. Beyond general preference optimization, RefAlign can be naturally extended to diverse scenarios, including safety and confidence alignment, by combining similarity-based rewards with task-specific objectives. Across multiple scenarios, RefAlign achieves performance comparable to prior alignment methods while operating without binary preference data or reward models. The code is available at https://github.com/mzhaoshuai/RefAlign.
