Auto-Prompt Ensemble for LLM Judge
Jiajie Li, Huayi Zhang, Peng Lin, Jinjun Xiong, Wei Xu
TL;DR
This work tackles the misalignment between human judgments and LLM-based evaluators caused by missing evaluation dimensions. It introduces Auto-Prompt Ensemble (APE), which automatically discovers task-specific evaluation dimensions from failure cases and uses a Collective Confidence ensemble to decide when to override initial judgments. Empirical results on Skywork Reward Preference and Reward Bench demonstrate consistent gains in agreement with human judgments, including strong zero-shot transfer across models. The approach enables reliable, test-time augmentation of LLM judges, offering a scalable and transferable method to bridge the human–machine evaluation gap.
Abstract
We present a novel framework that improves the reliability of LLM judges by selectively augmenting LLM with auxiliary evaluation dimensions. Existing LLM judges often miss crucial evaluation dimensions because they fail to recognize the implicit standards underlying human assessments. To address this challenge, we propose the Auto-Prompt Ensemble (APE), an adaptive framework that automatically learns evaluation dimensions from its failure cases. APE incorporates a confidence-based ensemble mechanism to decide when to adopt the judgments from additional evaluation dimensions through a novel confidence estimation approach called Collective Confidence. Extensive experiments demonstrate that APE improves the reliability of LLM Judge across diverse standard benchmarks. For instance, APE enhances GPT-4o agreement rate on Reward Bench from 87.2% to 90.5% in the zero-shot setting. Overall, APE provides a principled approach for LLM Judge to leverage test-time computation, and bridge the evaluation gap between human and LLM judges.
