Table of Contents
Fetching ...

Auto-Prompt Ensemble for LLM Judge

Jiajie Li, Huayi Zhang, Peng Lin, Jinjun Xiong, Wei Xu

TL;DR

This work tackles the misalignment between human judgments and LLM-based evaluators caused by missing evaluation dimensions. It introduces Auto-Prompt Ensemble (APE), which automatically discovers task-specific evaluation dimensions from failure cases and uses a Collective Confidence ensemble to decide when to override initial judgments. Empirical results on Skywork Reward Preference and Reward Bench demonstrate consistent gains in agreement with human judgments, including strong zero-shot transfer across models. The approach enables reliable, test-time augmentation of LLM judges, offering a scalable and transferable method to bridge the human–machine evaluation gap.

Abstract

We present a novel framework that improves the reliability of LLM judges by selectively augmenting LLM with auxiliary evaluation dimensions. Existing LLM judges often miss crucial evaluation dimensions because they fail to recognize the implicit standards underlying human assessments. To address this challenge, we propose the Auto-Prompt Ensemble (APE), an adaptive framework that automatically learns evaluation dimensions from its failure cases. APE incorporates a confidence-based ensemble mechanism to decide when to adopt the judgments from additional evaluation dimensions through a novel confidence estimation approach called Collective Confidence. Extensive experiments demonstrate that APE improves the reliability of LLM Judge across diverse standard benchmarks. For instance, APE enhances GPT-4o agreement rate on Reward Bench from 87.2% to 90.5% in the zero-shot setting. Overall, APE provides a principled approach for LLM Judge to leverage test-time computation, and bridge the evaluation gap between human and LLM judges.

Auto-Prompt Ensemble for LLM Judge

TL;DR

This work tackles the misalignment between human judgments and LLM-based evaluators caused by missing evaluation dimensions. It introduces Auto-Prompt Ensemble (APE), which automatically discovers task-specific evaluation dimensions from failure cases and uses a Collective Confidence ensemble to decide when to override initial judgments. Empirical results on Skywork Reward Preference and Reward Bench demonstrate consistent gains in agreement with human judgments, including strong zero-shot transfer across models. The approach enables reliable, test-time augmentation of LLM judges, offering a scalable and transferable method to bridge the human–machine evaluation gap.

Abstract

We present a novel framework that improves the reliability of LLM judges by selectively augmenting LLM with auxiliary evaluation dimensions. Existing LLM judges often miss crucial evaluation dimensions because they fail to recognize the implicit standards underlying human assessments. To address this challenge, we propose the Auto-Prompt Ensemble (APE), an adaptive framework that automatically learns evaluation dimensions from its failure cases. APE incorporates a confidence-based ensemble mechanism to decide when to adopt the judgments from additional evaluation dimensions through a novel confidence estimation approach called Collective Confidence. Extensive experiments demonstrate that APE improves the reliability of LLM Judge across diverse standard benchmarks. For instance, APE enhances GPT-4o agreement rate on Reward Bench from 87.2% to 90.5% in the zero-shot setting. Overall, APE provides a principled approach for LLM Judge to leverage test-time computation, and bridge the evaluation gap between human and LLM judges.

Paper Structure

This paper contains 27 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the APE framework. In the top pipeline, evaluation dimensions are automatically discovered by identifying failure cases and proposing targeted rubrics to correct them. In the bottom pipeline, a confidence-based ensemble aggregates judgments across verified dimensions, overriding the initial decision only when the collective confidence is sufficiently high.
  • Figure 2: Reliability plot for confidence estimation methods. Using GPT-4o with CoT as Judge on Reward Bench. A deep color indicates a higher percentage. The dashed diagonal line represents perfect calibration, where estimated confidence matches actual agreement.
  • Figure 3: An example of a failure case from the Skywork Reward Preference dataset, where GPT-4o incorrectly prefers a suboptimal response. On the right, each cell indicates whether a newly generated evaluation dimension addresses the corresponding failure case: black denotes success, while white denotes failure.
  • Figure 4: Impact of Number of Dimensions
  • Figure 5: Prompt used for LLM Judge inference.