Table of Contents
Fetching ...

Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning

Guangxuan Xu, Kai Xu, Shivchander Sudalairaj, Hao Wang, Akash Srivastava

TL;DR

Dr.SoW presents a cost-efficient, domain-flexible approach to preference annotation by leveraging a log-density ratio between strong- and weakly aligned off-the-shelf LLMs. The Strong-over-Weak hypothesis, supported by experiments across 221 model pairs, shows that larger human-alignment gaps yield higher-quality reward signals, enabling effective annotation without human data. The authors implement an end-to-end pipeline that customizes domain-specific reward criteria via instructions and in-context learning, achieving competitive results against SoTA reward classifiers and enabling downstream models like Llama-3-8B-Instruct to reach GPT-4-level performance on certain benchmarks. This work reduces data and compute overheads for reward modeling while offering adaptable, domain-aware reward functions for safer and more capable AI systems.

Abstract

Preference tuning relies on high-quality human preference data, which is often expensive and time-consuming to gather. In this paper, we introduce Dr.SoW (Density Ratio of Strong over Weak) a cost-effective method that eliminates the reliance for human annotation by leveraging off-the-shelf LLMs for preference data annotation. Dr.SoW uses the log-density ratio between a better-aligned and a less-aligned LLM as a reward signal. We evaluate Dr.SoW across 221 different LLM pairs and empirically find a strong correlation between the performance gap of the paired models and the quality of the reward signal. This insight provides a practical guideline for selecting LLMs for data annotation. Additionally, we introduce an end-to-end pipeline that customizes reward functions based on user query domains. Without fine-tuning, it improves accuracy on domain-specific evaluations. With a pair of Mistral-7B models, Dr.SoW achieves a RewardBench score of 82.6, outperforming the best trained reward functions from same model class and demonstrating competitive performance against SoTA models in Safety (91.0) and Reasoning (88.0) domains. Further, we preference-tune Llama-3-8B-Instruct using data annotated by Dr.SoW. Our approach pushes Llama-3-8B to achieve a 37.4 % (+15.1 %) win rate on ArenaHard and a 40.7 % (+17.8 %) win rate on length-controlled AlpacaEval 2.0.

Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning

TL;DR

Dr.SoW presents a cost-efficient, domain-flexible approach to preference annotation by leveraging a log-density ratio between strong- and weakly aligned off-the-shelf LLMs. The Strong-over-Weak hypothesis, supported by experiments across 221 model pairs, shows that larger human-alignment gaps yield higher-quality reward signals, enabling effective annotation without human data. The authors implement an end-to-end pipeline that customizes domain-specific reward criteria via instructions and in-context learning, achieving competitive results against SoTA reward classifiers and enabling downstream models like Llama-3-8B-Instruct to reach GPT-4-level performance on certain benchmarks. This work reduces data and compute overheads for reward modeling while offering adaptable, domain-aware reward functions for safer and more capable AI systems.

Abstract

Preference tuning relies on high-quality human preference data, which is often expensive and time-consuming to gather. In this paper, we introduce Dr.SoW (Density Ratio of Strong over Weak) a cost-effective method that eliminates the reliance for human annotation by leveraging off-the-shelf LLMs for preference data annotation. Dr.SoW uses the log-density ratio between a better-aligned and a less-aligned LLM as a reward signal. We evaluate Dr.SoW across 221 different LLM pairs and empirically find a strong correlation between the performance gap of the paired models and the quality of the reward signal. This insight provides a practical guideline for selecting LLMs for data annotation. Additionally, we introduce an end-to-end pipeline that customizes reward functions based on user query domains. Without fine-tuning, it improves accuracy on domain-specific evaluations. With a pair of Mistral-7B models, Dr.SoW achieves a RewardBench score of 82.6, outperforming the best trained reward functions from same model class and demonstrating competitive performance against SoTA models in Safety (91.0) and Reasoning (88.0) domains. Further, we preference-tune Llama-3-8B-Instruct using data annotated by Dr.SoW. Our approach pushes Llama-3-8B to achieve a 37.4 % (+15.1 %) win rate on ArenaHard and a 40.7 % (+17.8 %) win rate on length-controlled AlpacaEval 2.0.

Paper Structure

This paper contains 41 sections, 8 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: We analyze how different model pairs $(\pi_{\text{strong}}, \pi_{\text{weak}})$ impact the quality of the reward signal provided by (\ref{['eq:weak_strong_ratio']}). Each point represents one of 221 unique model pairs: 100 Llama-8B pairs (green) and 121 Mistral-7B pairs (blue). The x-axis denotes the alignment gap between $\pi_{\text{strong}}$ and $\pi_{\text{weak}}$, measured by ArenaHard scores, while the y-axis represents reward signal quality, measured by RewardBench scores. We observe a strong correlation between model alignment gap and reward signal quality, indicating that practitioners should pair a well-aligned $\pi_{\text{strong}}$ with a less-aligned $\pi_{\text{weak}}$ when using (\ref{['eq:weak_strong_ratio']}) as a reward signal.
  • Figure 2: Density ratio reward from different pairing combinations, with y-axis the numerator model, and x-axis denominator model. The five models chosen in each model family are sorted by their human-aligned level measured by ArenaHard. According to DPO implicit reward theory, models along the diagonal (red-outlined cells) theoretically yield optimal rewards, pairing models before and after DPO training. However, empirical results indicate that using the Base model as the denominator consistently yields higher scores (green-outlined cells), motivating our strong-over-weak density ratio reward function.
  • Figure 3: Instruction with detailed criterion to define preference in Safety domain. This prompt outlines key principles to ensure constructive, empathetic, and safe responses.
  • Figure 4: Density ratio rewards from various numerator and denominator model pairings, following Equation (\ref{['eq:weak_strong_ratio']}). Models, fine-tuned with different objectives, are ordered by their human-aligned levels measured by ArenaHard. Generally, larger alignment gaps between numerator and denominator models yield stronger reward functions, supporting the "Strong-over-Weak Hypothesis" in our reward design. This trend holds across models fine-tuned with distinct objectives. An exception, Instruct(PPO)—an official Meta instruct model—achieves a strong ArenaHard score likely due to more intensive SFT training rather than improved human alignment.
  • Figure 5: Few-shot Instruction template to guide rewards.
  • ...and 8 more figures