ARF-RLHF: Adaptive Reward-Following for RLHF through Emotion-Driven Self-Supervision and Trace-Biased Dynamic Optimization
YuXuan Zhang
TL;DR
This paper addresses the bottleneck of binary supervision in RLHF by introducing Adaptive Reward-Following (ARF), which extracts continuous satisfaction trajectories from free-form feedback. It couples ARF with TraceBias, a score-based actor-critic fine-tuning method that normalizes and optimizes reward trajectories rather than binary labels, enabling personalized and scalable alignment. Across multiple lightweight LLMs and diverse tasks, ARF matches or surpasses PPO and DPO while reducing annotation costs, and TraceBias demonstrates robustness under synthetic and human supervision. By grounding the approach in linguistic theories of satisfaction and providing a coherent self-supervised RLHF pipeline, the work offers a practical path toward personalized, reliable alignment of large language models.
Abstract
Current RLHF methods such as PPO and DPO typically reduce human preferences to binary labels, which are costly to obtain and too coarse to reflect individual variation. We observe that expressions of satisfaction and dissatisfaction follow stable linguistic patterns across users, indicating that more informative supervisory signals can be extracted from free-form feedback. Building on this insight, we introduce Adaptive Reward-Following (ARF), which converts natural feedback into continuous preference trajectories and optimizes them using the novel TraceBias algorithm. Across diverse LLMs and preference domains, ARF consistently outperforms PPO and DPO, improving alignment by up to 7.6%. Our results demonstrate that continuous reward modeling provides a scalable path toward personalized and theoretically grounded RLHF.
