Table of Contents
Fetching ...

HAF-RM: A Hybrid Alignment Framework for Reward Model Training

Shujun Liu, Xiaoyu Shen, Yuhang Lai, Siyuan Wang, Shengbin Yue, Zengfeng Huang, Xuanjing Huang, Zhongyu Wei

TL;DR

This work introduces HaF-RM, a Hybrid Alignment Framework for reward-model training that jointly optimizes a shared internal preference model across reward and policy components while adding a token-level policy loss to complement the standard reward loss. The key idea is to couple token-level supervision with sequence-level reward optimization through a hybrid loss $\mathcal{L}_H = \mathbb{E}_d[ D_1(r(d), r^*(d)) + \alpha\cdot D_2(\pi(d), \pi^*(d)) ]$, enabling better calibration and alignment of reward models. Empirical results on five public datasets across multiple backbones show HaF outperforms Baseline and DPO in intrinsic reward evaluation and downstream tasks such as Best-of-N and RLHF, with stronger generalization to out-of-distribution data. The framework offers a principled approach to enhancing reward-model reliability, pointing to practical improvements in RLHF pipelines and data construction for LLM alignment. The work provides code and demonstrates that incorporating policy loss as regularization can stabilize representations and improve performance across varied language-model backbones.

Abstract

The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the conventional training framework for reward models that directly optimizes the predicted rewards. In this paper, we propose a hybrid alignment framework HaF-RM for reward model training by introducing an additional constraint on token-level policy probabilities in addition to the reward score. It can simultaneously supervise the internal preference model at the token level and optimize the mapping layer of the reward model at the sequence level. Experiment results on five datasets sufficiently show the validity and effectiveness of our proposed hybrid framework for training a high-quality reward model. By decoupling the reward modeling procedure and incorporating hybrid supervision, our HaF-RM framework offers a principled and effective approach to enhancing the performance and alignment of reward models, a critical component in the responsible development of powerful language models. We release our code at https://haf-rm.github.io.

HAF-RM: A Hybrid Alignment Framework for Reward Model Training

TL;DR

This work introduces HaF-RM, a Hybrid Alignment Framework for reward-model training that jointly optimizes a shared internal preference model across reward and policy components while adding a token-level policy loss to complement the standard reward loss. The key idea is to couple token-level supervision with sequence-level reward optimization through a hybrid loss , enabling better calibration and alignment of reward models. Empirical results on five public datasets across multiple backbones show HaF outperforms Baseline and DPO in intrinsic reward evaluation and downstream tasks such as Best-of-N and RLHF, with stronger generalization to out-of-distribution data. The framework offers a principled approach to enhancing reward-model reliability, pointing to practical improvements in RLHF pipelines and data construction for LLM alignment. The work provides code and demonstrates that incorporating policy loss as regularization can stabilize representations and improve performance across varied language-model backbones.

Abstract

The reward model has become increasingly important in alignment, assessment, and data construction for large language models (LLMs). Most existing researchers focus on enhancing reward models through data improvements, following the conventional training framework for reward models that directly optimizes the predicted rewards. In this paper, we propose a hybrid alignment framework HaF-RM for reward model training by introducing an additional constraint on token-level policy probabilities in addition to the reward score. It can simultaneously supervise the internal preference model at the token level and optimize the mapping layer of the reward model at the sequence level. Experiment results on five datasets sufficiently show the validity and effectiveness of our proposed hybrid framework for training a high-quality reward model. By decoupling the reward modeling procedure and incorporating hybrid supervision, our HaF-RM framework offers a principled and effective approach to enhancing the performance and alignment of reward models, a critical component in the responsible development of powerful language models. We release our code at https://haf-rm.github.io.
Paper Structure (46 sections, 20 equations, 11 figures, 11 tables)

This paper contains 46 sections, 20 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: HaF model structure. It retains the policy layer which outputs the token-level probability.
  • Figure 2: HaF training framework. We add the reward layer to the language model while retaining its policy layer. During training, we optimize both the token-level rewards and sequence-level rewards for the input triplets by maximizing the reward differences between better responses and worse responses.
  • Figure 3: HaF tends to assign higher scores to the responses it generates. The x-axis represents the score difference between the ideal reward model's evaluation of the content generated by HaF's policy head and the content generated by the model trained with DPO. The y-axis indicates the score difference when HaF evaluates these two outputs. Different colors represent different model checkpoint selection strategies.
  • Figure 4: The performance differences of HaF / baseline / DPO under mixed preference training, with light shading indicating the upper bound performance of individually trained reward models on each dataset.
  • Figure 5: Average win rates of responses selected by the HAF reward model, baseline model and the DPO reward model. Circles may overlap as different models select the same response.
  • ...and 6 more figures