Table of Contents
Fetching ...

Act-Adaptive Margin: Dynamically Calibrating Reward Models for Subjective Ambiguity

Feiteng Fang, Dingwei Chen, Xiang Huang, Ting-En Lin, Yuchuan Wu, Xiong Liu, Xinge Ye, Ziqiang Liu, Haonan Zhang, Liang Zhu, Hamid Alinejad-Rokny, Min Yang, Yongbin Li

TL;DR

This paper tackles reward modeling for subjective tasks by addressing the limitations of Bradley-Terry (BT) models in ambiguous preference settings. It introduces Act-Adaptive Margin (AAM), which dynamically calibrates preference margins using the reward model's own internal parameter knowledge, via two implementations: Probability-Ratio Adaptive Margin (PR) and Loss-Difference Adaptive Margin (LD). PR uses log-likelihood ratios between the current policy and a reference to form an adaptive margin, while LD uses generation-probability guided margins that align with SFT losses; both converge under the unified AAM framework. Empirical results on RewardBench, JudgeBench, and role-playing benchmarks Charm and CharacterEval show substantial improvements over BT and GPT-Margin, with state-of-the-art performance in downstream alignment tasks when combined with GRPO. The work provides a practical, annotation-free approach to calibrating subjective reward signals and demonstrates meaningful impact for alignment in subjective domains, releasing Charm and related benchmarks for further study.

Abstract

Currently, most reinforcement learning tasks focus on domains like mathematics and programming, where verification is relatively straightforward. However, in subjective tasks such as role-playing, alignment techniques struggle to make progress, primarily because subjective reward modeling using the Bradley-Terry model faces significant challenges when dealing with ambiguous preferences. To improve reward modeling in subjective tasks, this paper proposes AAM (\textbf{\underline{A}}ct-\textbf{\underline{A}}daptive \textbf{\underline{M}}argin), which enhances reward modeling by dynamically calibrating preference margins using the model's internal parameter knowledge. We design two versions of AAM that efficiently generate contextually-appropriate preference gaps without additional human annotation. This approach fundamentally improves how reward models handle subjective rewards by better integrating generative understanding with preference scoring. To validate AAM's effectiveness in subjective reward modeling, we conduct evaluations on RewardBench, JudgeBench, and challenging role-playing tasks. Results show that AAM significantly improves subjective reward modeling performance, enhancing Bradley-Terry reward models by 2.95\% in general tasks and 4.85\% in subjective role-playing tasks. Furthermore, reward models trained with AAM can help downstream alignment tasks achieve better results. Our test results show that applying rewards generated by AAM-Augmented RM to preference learning techniques (e.g., GRPO) achieves state-of-the-art results on CharacterEval and Charm. Code and dataset are available at https://github.com/calubkk/AAM.

Act-Adaptive Margin: Dynamically Calibrating Reward Models for Subjective Ambiguity

TL;DR

This paper tackles reward modeling for subjective tasks by addressing the limitations of Bradley-Terry (BT) models in ambiguous preference settings. It introduces Act-Adaptive Margin (AAM), which dynamically calibrates preference margins using the reward model's own internal parameter knowledge, via two implementations: Probability-Ratio Adaptive Margin (PR) and Loss-Difference Adaptive Margin (LD). PR uses log-likelihood ratios between the current policy and a reference to form an adaptive margin, while LD uses generation-probability guided margins that align with SFT losses; both converge under the unified AAM framework. Empirical results on RewardBench, JudgeBench, and role-playing benchmarks Charm and CharacterEval show substantial improvements over BT and GPT-Margin, with state-of-the-art performance in downstream alignment tasks when combined with GRPO. The work provides a practical, annotation-free approach to calibrating subjective reward signals and demonstrates meaningful impact for alignment in subjective domains, releasing Charm and related benchmarks for further study.

Abstract

Currently, most reinforcement learning tasks focus on domains like mathematics and programming, where verification is relatively straightforward. However, in subjective tasks such as role-playing, alignment techniques struggle to make progress, primarily because subjective reward modeling using the Bradley-Terry model faces significant challenges when dealing with ambiguous preferences. To improve reward modeling in subjective tasks, this paper proposes AAM (\textbf{\underline{A}}ct-\textbf{\underline{A}}daptive \textbf{\underline{M}}argin), which enhances reward modeling by dynamically calibrating preference margins using the model's internal parameter knowledge. We design two versions of AAM that efficiently generate contextually-appropriate preference gaps without additional human annotation. This approach fundamentally improves how reward models handle subjective rewards by better integrating generative understanding with preference scoring. To validate AAM's effectiveness in subjective reward modeling, we conduct evaluations on RewardBench, JudgeBench, and challenging role-playing tasks. Results show that AAM significantly improves subjective reward modeling performance, enhancing Bradley-Terry reward models by 2.95\% in general tasks and 4.85\% in subjective role-playing tasks. Furthermore, reward models trained with AAM can help downstream alignment tasks achieve better results. Our test results show that applying rewards generated by AAM-Augmented RM to preference learning techniques (e.g., GRPO) achieves state-of-the-art results on CharacterEval and Charm. Code and dataset are available at https://github.com/calubkk/AAM.

Paper Structure

This paper contains 19 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: An example from a role-playing task illustrating the difficulty in obtaining reward signals for subjective abstract tasks: Three LLMs extend a "Naruto" dialogue between Sasuke and Orochimaru, each with varying responses, making reward signal assessment difficult.
  • Figure 2: An overview of the AAM method, along with the construction process of Charm.
  • Figure 3: The character distribution in RoleplayPref consists of 3 primary categories and 13 subcategories.
  • Figure 4: Human evaluation results comparing AAM-GRPO-32B with Claude 3.5 Sonnet, GPT-4o, and Doubao-Pro-Character.
  • Figure 5: The distinction between subjective and objective tasks, as well as the derivation process of the AAM formula.
  • ...and 3 more figures