Table of Contents
Fetching ...

Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

Ilgee Hong, Zichong Li, Alexander Bukharin, Yixiao Li, Haoming Jiang, Tianbao Yang, Tuo Zhao

TL;DR

This paper addresses the rigidity of standard RLHF reward modeling, where a Bradley–Terry cross-entropy loss imposes a linear scaling between reward differences and preference logits. It introduces Adaptive Preference Scaling (APS), a distributionally robust optimization–based loss that assigns an instance-specific scaling parameter $\tau$ to each pair of trajectory segments, yielding a strictly convex, univariate optimization that flexibly shapes rewards and can be extended to direct preference optimization (Ada-DPO) and quadratic-regularization variants. Empirically, APS improves policy performance and aligns reward learning with downstream optimization in robotic control and large language model tasks, while reducing hyperparameter tuning burdens. The framework provides a principled, scalable approach to handle varying strength in human preferences, with practical implications for robust RLHF deployment.

Abstract

Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values by learning rewards from human preference data. Due to various reasons, however, such data typically takes the form of rankings over pairs of trajectory segments, which fails to capture the varying strengths of preferences across different pairs. In this paper, we propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO), designed to address this uncertainty in preference strength. By incorporating an adaptive scaling parameter into the loss for each pair, our method increases the flexibility of the reward function. Specifically, it assigns small scaling parameters to pairs with ambiguous preferences, leading to more comparable rewards, and large scaling parameters to those with clear preferences for more distinct rewards. Computationally, our proposed loss function is strictly convex and univariate with respect to each scaling parameter, enabling its efficient optimization through a simple second-order algorithm. Our method is versatile and can be readily adapted to various preference optimization frameworks, including direct preference optimization (DPO). Our experiments with robotic control and natural language generation with large language models (LLMs) show that our method not only improves policy performance but also aligns reward function selection more closely with policy optimization, simplifying the hyperparameter tuning process.

Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

TL;DR

This paper addresses the rigidity of standard RLHF reward modeling, where a Bradley–Terry cross-entropy loss imposes a linear scaling between reward differences and preference logits. It introduces Adaptive Preference Scaling (APS), a distributionally robust optimization–based loss that assigns an instance-specific scaling parameter to each pair of trajectory segments, yielding a strictly convex, univariate optimization that flexibly shapes rewards and can be extended to direct preference optimization (Ada-DPO) and quadratic-regularization variants. Empirically, APS improves policy performance and aligns reward learning with downstream optimization in robotic control and large language model tasks, while reducing hyperparameter tuning burdens. The framework provides a principled, scalable approach to handle varying strength in human preferences, with practical implications for robust RLHF deployment.

Abstract

Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values by learning rewards from human preference data. Due to various reasons, however, such data typically takes the form of rankings over pairs of trajectory segments, which fails to capture the varying strengths of preferences across different pairs. In this paper, we propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO), designed to address this uncertainty in preference strength. By incorporating an adaptive scaling parameter into the loss for each pair, our method increases the flexibility of the reward function. Specifically, it assigns small scaling parameters to pairs with ambiguous preferences, leading to more comparable rewards, and large scaling parameters to those with clear preferences for more distinct rewards. Computationally, our proposed loss function is strictly convex and univariate with respect to each scaling parameter, enabling its efficient optimization through a simple second-order algorithm. Our method is versatile and can be readily adapted to various preference optimization frameworks, including direct preference optimization (DPO). Our experiments with robotic control and natural language generation with large language models (LLMs) show that our method not only improves policy performance but also aligns reward function selection more closely with policy optimization, simplifying the hyperparameter tuning process.
Paper Structure (27 sections, 2 theorems, 35 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 27 sections, 2 theorems, 35 equations, 11 figures, 9 tables, 1 algorithm.

Key Result

Proposition 3.1

Assume we have a pair of trajectory segments $z_1, z_2$, and the preference distribution $p(z_1\succ z_2)=p^*\in(0,1)$, i.e., the probability, that $z_1$ is preferred over $z_2$, is $p^*$. Consider the problem of minimizing the expectation of our adaptive loss function over the preference distributi Then the minimizer $\tau^*$ and $r^*$ of the expected loss satisfy Here, $\sigma^{-1}$ is the inve

Figures (11)

  • Figure 1: Visualization of the loss function (left) and its gradient (right) on different reward differences.
  • Figure 2: Learning curve plots (top) and percentile plots (bottom) for Pref and Ada-Pref. For the learning curve plots, returns at each timestep are averaged across 10 different seeds, then smoothed over timesteps using an exponential moving average (EMA) with a smoothing factor of $\alpha=0.1$. For the percentile plots, returns from 10 different seeds are sorted in ascending order.
  • Figure 3: The best win rate and the preference prediction accuracy of the corresponding model.
  • Figure 4: The best preference prediction accuracy and the win rate of the corresponding model.
  • Figure 5: Histogram of the learned scaling factors, relationship between preference strength and the learned scaling factors, and relationship between preference strength and the learned reward difference. All plots are from the Ant task.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Proposition 3.1
  • Remark 3.1
  • Remark 3.2
  • Proposition 3.2