Table of Contents
Fetching ...

Improve LLM-as-a-Judge Ability as a General Ability

Jiachen Yu, Shaoning Sun, Xiaohui Hu, Jiaxu Yan, Kaidong Yu, Xuelong Li

TL;DR

This work reframes LLM-based judging as a general capability and introduces a data-efficient two-stage training framework (SFT warm-up followed by DPO enhancement) combined with a targeted data-synthesis pipeline. The approach achieves state-of-the-art results on RewardBench with only a fraction of the data used by prior methods and demonstrates strong general-ability performance on multiple benchmarks, including AlignBench and MT-Bench. The authors also validate downstream benefits by showing improved policy-model optimization using judge-provided signals, and they open-source model weights and data to promote reproducibility and further research. Overall, the study shows that judicious data generation and staged training can jointly improve judge quality and broader model capabilities while reducing data and compute requirements.

Abstract

LLM-as-a-Judge leverages the generative and reasoning capabilities of large language models (LLMs) to evaluate LLM responses across diverse scenarios, providing accurate preference signals. This approach plays a vital role in aligning LLMs with human values, ensuring ethical and reliable AI outputs that align with societal norms. Recent studies have raised many methods to train LLM as generative judges, but most of them are data consuming or lack accuracy, and only focus on LLM's judge ability. In this work, we regard judge ability as a general ability of LLM and implement a two-stage training approach, comprising supervised fine-tuning (SFT) warm-up and direct preference optimization (DPO) enhancement, to achieve judge style adaptation and improve judgment accuracy. Additionally, we introduce an efficient data synthesis method to generate judgmental content. Experimental results demonstrate that our approach, utilizing only about 2% to 40% of the data required by other methods, achieves SOTA performance on RewardBench. Furthermore, our training method enhances the general capabilities of the model by constructing complicated judge task, and the judge signals provided by our model have significantly enhanced the downstream DPO training performance of our internal models in our test to optimize policy model with Judge Model. We also open-source our model weights and training data to facilitate further research.

Improve LLM-as-a-Judge Ability as a General Ability

TL;DR

This work reframes LLM-based judging as a general capability and introduces a data-efficient two-stage training framework (SFT warm-up followed by DPO enhancement) combined with a targeted data-synthesis pipeline. The approach achieves state-of-the-art results on RewardBench with only a fraction of the data used by prior methods and demonstrates strong general-ability performance on multiple benchmarks, including AlignBench and MT-Bench. The authors also validate downstream benefits by showing improved policy-model optimization using judge-provided signals, and they open-source model weights and data to promote reproducibility and further research. Overall, the study shows that judicious data generation and staged training can jointly improve judge quality and broader model capabilities while reducing data and compute requirements.

Abstract

LLM-as-a-Judge leverages the generative and reasoning capabilities of large language models (LLMs) to evaluate LLM responses across diverse scenarios, providing accurate preference signals. This approach plays a vital role in aligning LLMs with human values, ensuring ethical and reliable AI outputs that align with societal norms. Recent studies have raised many methods to train LLM as generative judges, but most of them are data consuming or lack accuracy, and only focus on LLM's judge ability. In this work, we regard judge ability as a general ability of LLM and implement a two-stage training approach, comprising supervised fine-tuning (SFT) warm-up and direct preference optimization (DPO) enhancement, to achieve judge style adaptation and improve judgment accuracy. Additionally, we introduce an efficient data synthesis method to generate judgmental content. Experimental results demonstrate that our approach, utilizing only about 2% to 40% of the data required by other methods, achieves SOTA performance on RewardBench. Furthermore, our training method enhances the general capabilities of the model by constructing complicated judge task, and the judge signals provided by our model have significantly enhanced the downstream DPO training performance of our internal models in our test to optimize policy model with Judge Model. We also open-source our model weights and training data to facilitate further research.
Paper Structure (32 sections, 4 equations, 11 figures, 5 tables)

This paper contains 32 sections, 4 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Our model achieves both strong general abilities and judge abilities using only a minimal amount of data.
  • Figure 2: Data synthesis and model training pipeline. Our pipeline contains 4 stages in order. $q$ :The question in preference dataset. $a_c$, $a_r$: The chosen and rejected answer to $q$ in preference dataset. $inst$: Judge instructions with $q$, $a_c$, $a_r$ merged in. $j_{CoT}$: The reasoning process when giving out a judgement. $j_{res}$: Judge result towards judge instruction. $j_c$, $j_r$: The chosen and rejected judge answer in DPO training process.
  • Figure 3: Result of ablation test on training stages. This indicates both SFT and DPO stage have positive effects, while the effect of DPO is larger. And compared to the traditional SFT training of general models, pre-learning the format of judge tasks during the SFT phase enables the model to achieve superior results in the DPO stage.
  • Figure 4: Result of ablation test on data amount. As data amount increase in each of the two stages, the model's metrics show an upward trend, reaching peak at around 20k SFT + and 20k DPO data.
  • Figure 5: Evaluation results with different prompts. The model's metrics show minimal variance towards different prompts, indicating that our data synthesis strategy improves the model's adaptability to diverse prompts.
  • ...and 6 more figures