Improve LLM-as-a-Judge Ability as a General Ability
Jiachen Yu, Shaoning Sun, Xiaohui Hu, Jiaxu Yan, Kaidong Yu, Xuelong Li
TL;DR
This work reframes LLM-based judging as a general capability and introduces a data-efficient two-stage training framework (SFT warm-up followed by DPO enhancement) combined with a targeted data-synthesis pipeline. The approach achieves state-of-the-art results on RewardBench with only a fraction of the data used by prior methods and demonstrates strong general-ability performance on multiple benchmarks, including AlignBench and MT-Bench. The authors also validate downstream benefits by showing improved policy-model optimization using judge-provided signals, and they open-source model weights and data to promote reproducibility and further research. Overall, the study shows that judicious data generation and staged training can jointly improve judge quality and broader model capabilities while reducing data and compute requirements.
Abstract
LLM-as-a-Judge leverages the generative and reasoning capabilities of large language models (LLMs) to evaluate LLM responses across diverse scenarios, providing accurate preference signals. This approach plays a vital role in aligning LLMs with human values, ensuring ethical and reliable AI outputs that align with societal norms. Recent studies have raised many methods to train LLM as generative judges, but most of them are data consuming or lack accuracy, and only focus on LLM's judge ability. In this work, we regard judge ability as a general ability of LLM and implement a two-stage training approach, comprising supervised fine-tuning (SFT) warm-up and direct preference optimization (DPO) enhancement, to achieve judge style adaptation and improve judgment accuracy. Additionally, we introduce an efficient data synthesis method to generate judgmental content. Experimental results demonstrate that our approach, utilizing only about 2% to 40% of the data required by other methods, achieves SOTA performance on RewardBench. Furthermore, our training method enhances the general capabilities of the model by constructing complicated judge task, and the judge signals provided by our model have significantly enhanced the downstream DPO training performance of our internal models in our test to optimize policy model with Judge Model. We also open-source our model weights and training data to facilitate further research.
