Table of Contents
Fetching ...

Language Model Preference Evaluation with Multiple Weak Evaluators

Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Kaize Ding, Ranjay Krishna

TL;DR

PGED introduces a modular framework to address cycles and noise in LLM preference evaluations by ensembling multiple weak evaluators into a collective preference graph and applying denoising to yield a directed acyclic graph. The approach combines graph ensemble with a weighted feedback-arc-set–based denoising and a graph-to-ranking step to produce robust per-question and global rankings for response selection, model ranking, and data selection. The authors provide theoretical recovery guarantees under a perturbation model and demonstrate strong empirical gains across ten benchmarks, including cases where small-model ensembles outperform single large evaluators. The work highlights the value of weak-signal aggregation and graph-based denoising for scalable, reliable evaluation in LLM research and deployment.

Abstract

Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs' quality regarding preference remains a critical challenge. While existing works usually leverage a strong LLM as the judge for comparing LLMs' response pairwisely, such a single-evaluator approach is vulnerable to cyclic preference, i.e., output A is better than B, B than C, but C is better than A, causing contradictory evaluation results. To address this, we introduce PGED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate PGED 's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning. Notably, PGED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform strong ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.

Language Model Preference Evaluation with Multiple Weak Evaluators

TL;DR

PGED introduces a modular framework to address cycles and noise in LLM preference evaluations by ensembling multiple weak evaluators into a collective preference graph and applying denoising to yield a directed acyclic graph. The approach combines graph ensemble with a weighted feedback-arc-set–based denoising and a graph-to-ranking step to produce robust per-question and global rankings for response selection, model ranking, and data selection. The authors provide theoretical recovery guarantees under a perturbation model and demonstrate strong empirical gains across ten benchmarks, including cases where small-model ensembles outperform single large evaluators. The work highlights the value of weak-signal aggregation and graph-based denoising for scalable, reliable evaluation in LLM research and deployment.

Abstract

Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs' quality regarding preference remains a critical challenge. While existing works usually leverage a strong LLM as the judge for comparing LLMs' response pairwisely, such a single-evaluator approach is vulnerable to cyclic preference, i.e., output A is better than B, B than C, but C is better than A, causing contradictory evaluation results. To address this, we introduce PGED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate PGED 's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning. Notably, PGED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform strong ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.

Paper Structure

This paper contains 66 sections, 7 theorems, 27 equations, 7 figures, 18 tables, 4 algorithms.

Key Result

Theorem 1

Suppose $G_1, \dots, G_N\overset{\text{i.i.d.}}{\sim}\mathcal{G}(G, \delta_1, \delta_2)$ for some ground truth $G=(V, A)$. Let $\widehat{G}$ be the graph ensembled from $G_1, \dots, G_N$ by operations defined in Section sec:basic_operations. Then, as long as $\delta_1=0.5-\epsilon$ for some $\epsilo where $G\subseteq\text{MAS}(\widehat{G})$ represents that $G$ is a subgraph of $\text{MAS}(\widehat

Figures (7)

  • Figure 1: (a) A preference graph exhibiting cyclic inconsistencies (e.g., A $\succ$ B $\succ$ C $\succ$ A), which violate transitivity. (b) Empirical results showing that even advanced LLMs (e.g., GPT-4-o) exhibit significant noise in preference judgments, leading to inconsistent evaluations. (c) Overview of our proposed framework, PGED, which ensembles multiple preference evaluators and applies denoising to recover a directed acyclic graph.
  • Figure 2: Comparison of PGED with GPT-3.5, GPT-4-o-mini, and GPT-4-o on 100 randomly selected tasks. PGED consistently outperforms GPT-3.5 across all tasks and surpasses GPT-4-o-mini on challenging tasks like HumanEval and GSM8k, showcasing the effectiveness of weak evaluator aggregation with graph denoising.
  • Figure 3: Performance comparison of different methods (Random, Longest, ContraSolver, and PGED) across multiple benchmarks. The results show PGED effectively filters low-quality responses, improving performance and model alignment over baselines.
  • Figure 4: Case studies showcasing the raw and denoised preference graphs.
  • Figure 5: Comparison of PGED and (w/o ensemble) variants. PGED outperforms due to preserving more information by directly ensembling preference graphs, while rank aggregation in the (w/o ensemble) methods leads to performance loss.
  • ...and 2 more figures

Theorems & Definitions (14)

  • Theorem 1
  • Theorem 1
  • Proof J.1
  • Lemma 1
  • Proof J.2
  • Proposition 1: Acyclicity
  • Proof M.1
  • Lemma 2: Range and update magnitude
  • Proof M.2
  • Lemma 3: Bucket maintenance in $\mathcal{O}(1)$ amortized time
  • ...and 4 more