Sharpe Ratio-Guided Active Learning for Preference Optimization in RLHF
Syrine Belakaria, Joshua Kazdan, Charles Marx, Chris Cundy, Willie Neiswanger, Sanmi Koyejo, Barbara E. Engelhardt, Stefano Ermon
TL;DR
This work tackles the high cost of RLHF data labeling by introducing Sharpe-ratio–based active learning for Direct Preference Optimization (DPO). It derives a closed-form, per-tuple Sharpe ratio using the two possible gradient updates induced by a prompt–response pair and proposes two instantiations, SHARP and W-SHARP, to select the most informative data under budget constraints. The acquisition function balances expected gradient impact with risk, enabling memory-efficient computation and reducing the labeling burden. Empirical results on Helpful-Harmless and Stanford Human Preferences datasets across multiple model sizes show up to a 5% improvement in win-rate over the DPO baseline with limited annotations, demonstrating enhanced data efficiency in RLHF alignment.
Abstract
Reinforcement learning from human feedback (RLHF) has become a cornerstone of the training and alignment pipeline for large language models (LLMs). Recent advances, such as direct preference optimization (DPO), have simplified the preference learning step. However, collecting preference data remains a challenging and costly process, often requiring expert annotation. This cost can be mitigated by carefully selecting the data points presented for annotation. In this work, we propose an active learning approach to efficiently select prompt and preference pairs using a risk assessment strategy based on the Sharpe Ratio. To address the challenge of unknown preferences prior to annotation, our method evaluates the gradients of all potential preference annotations to assess their impact on model updates. These gradient-based evaluations enable risk assessment of data points regardless of the annotation outcome. By leveraging the DPO loss derivations, we derive a closed-form expression for computing these Sharpe ratios on a per-tuple basis, ensuring our approach remains both tractable and computationally efficient. We also introduce two variants of our method, each making different assumptions about prior information. Experimental results demonstrate that our method outperforms the baseline by up to 5% in win rates against the chosen completion with limited human preference data across several language models and real-world datasets.
