Table of Contents
Fetching ...

ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization

YuXuan Zhang

TL;DR

This paper addresses the bottleneck of binary supervision in RLHF by introducing Adaptive Reward-Following (ARF), which extracts continuous satisfaction trajectories from free-form feedback. It couples ARF with TraceBias, a score-based actor-critic fine-tuning method that normalizes and optimizes reward trajectories rather than binary labels, enabling personalized and scalable alignment. Across multiple lightweight LLMs and diverse tasks, ARF matches or surpasses PPO and DPO while reducing annotation costs, and TraceBias demonstrates robustness under synthetic and human supervision. By grounding the approach in linguistic theories of satisfaction and providing a coherent self-supervised RLHF pipeline, the work offers a practical path toward personalized, reliable alignment of large language models.

Abstract

Current RLHF methods such as PPO and DPO typically reduce human preferences to binary labels, which are costly to obtain and too coarse to reflect individual variation. We observe that expressions of satisfaction and dissatisfaction follow stable linguistic patterns across users, indicating that more informative supervisory signals can be extracted from free-form feedback. Building on this insight, we introduce Adaptive Reward-Following (ARF), which converts natural feedback into continuous preference trajectories and optimizes them using the novel TraceBias algorithm. Across diverse LLMs and preference domains, ARF consistently outperforms PPO and DPO, improving alignment by up to 7.6%. Our results demonstrate that continuous reward modeling provides a scalable path toward personalized and theoretically grounded RLHF.

ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization

TL;DR

This paper addresses the bottleneck of binary supervision in RLHF by introducing Adaptive Reward-Following (ARF), which extracts continuous satisfaction trajectories from free-form feedback. It couples ARF with TraceBias, a score-based actor-critic fine-tuning method that normalizes and optimizes reward trajectories rather than binary labels, enabling personalized and scalable alignment. Across multiple lightweight LLMs and diverse tasks, ARF matches or surpasses PPO and DPO while reducing annotation costs, and TraceBias demonstrates robustness under synthetic and human supervision. By grounding the approach in linguistic theories of satisfaction and providing a coherent self-supervised RLHF pipeline, the work offers a practical path toward personalized, reliable alignment of large language models.

Abstract

Current RLHF methods such as PPO and DPO typically reduce human preferences to binary labels, which are costly to obtain and too coarse to reflect individual variation. We observe that expressions of satisfaction and dissatisfaction follow stable linguistic patterns across users, indicating that more informative supervisory signals can be extracted from free-form feedback. Building on this insight, we introduce Adaptive Reward-Following (ARF), which converts natural feedback into continuous preference trajectories and optimizes them using the novel TraceBias algorithm. Across diverse LLMs and preference domains, ARF consistently outperforms PPO and DPO, improving alignment by up to 7.6%. Our results demonstrate that continuous reward modeling provides a scalable path toward personalized and theoretically grounded RLHF.

Paper Structure

This paper contains 55 sections, 40 equations, 9 figures, 19 tables, 1 algorithm.

Figures (9)

  • Figure 1: Illustrates the overall workflow of our framework. We begin by deriving posterior satisfaction estimates from natural user feedback via a Static Satisfaction Scorer (Step 1). These samples are then stored and augmented through synonym substitution, truncation, and reweighting to form a diversified reward corpus (Step 2). The ARF scorer is trained with soft labels to predict satisfaction scores and is continuously updated (Step 3). Finally, the TraceBias algorithm leverages ARF-generated rewards to fine-tune the LLM (Step 4), completing a fully self-supervised RLHF pipeline.
  • Figure 2: We compare the gradient norm statistics of PPO, using a clip range $\epsilon = 0.2$ as in the original paper schulman2017proximalpolicyoptimizationalgorithms and TraceBias with DAM. DAM exhibits lower variance and more stable gradient magnitudes, suggesting improved training stability and potential for enhanced performance.(V is shown in appendix \ref{['app:V']})
  • Figure 3: Tracking preference shifts using ARF. Performance drops reflect deliberate adaptation to new negative signals, validating robustness under non-stationary feedback.
  • Figure 4: Average performance comparison under different baselines' fine-tuning. TraceBias consistently outperforms PPO and DPO across tasks. Single models' performance in appendix \ref{['app:RLHFBaselines']}.
  • Figure 5: V Gradient norm comparison between PPO (with clip range $\epsilon=0.2$) and TraceBias with DAM.
  • ...and 4 more figures