Language Model Preference Evaluation with Multiple Weak Evaluators
Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Kaize Ding, Ranjay Krishna
TL;DR
PGED introduces a modular framework to address cycles and noise in LLM preference evaluations by ensembling multiple weak evaluators into a collective preference graph and applying denoising to yield a directed acyclic graph. The approach combines graph ensemble with a weighted feedback-arc-set–based denoising and a graph-to-ranking step to produce robust per-question and global rankings for response selection, model ranking, and data selection. The authors provide theoretical recovery guarantees under a perturbation model and demonstrate strong empirical gains across ten benchmarks, including cases where small-model ensembles outperform single large evaluators. The work highlights the value of weak-signal aggregation and graph-based denoising for scalable, reliable evaluation in LLM research and deployment.
Abstract
Despite the remarkable success of Large Language Models (LLMs), evaluating their outputs' quality regarding preference remains a critical challenge. While existing works usually leverage a strong LLM as the judge for comparing LLMs' response pairwisely, such a single-evaluator approach is vulnerable to cyclic preference, i.e., output A is better than B, B than C, but C is better than A, causing contradictory evaluation results. To address this, we introduce PGED (Preference Graph Ensemble and Denoise), a novel approach that leverages multiple model-based evaluators to construct preference graphs, and then ensembles and denoises these graphs for acyclic, non-contradictory evaluation results. We provide theoretical guarantees for our framework, demonstrating its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks demonstrate PGED 's superiority in three applications: 1) model ranking for evaluation, 2) response selection for test-time scaling, and 3) data selection for model fine-tuning. Notably, PGED combines small LLM evaluators (e.g., Llama3-8B, Mistral-7B, Qwen2-7B) to outperform strong ones (e.g., Qwen2-72B), showcasing its effectiveness in enhancing evaluation reliability and improving model performance.
