PiCO: Peer Review in LLMs based on the Consistency Optimization
Kun-Peng Ning, Shuo Yang, Yu-Yang Liu, Jia-Yu Yao, Zhen-Hui Liu, Yong-Hong Tian, Yibing Song, Li Yuan
TL;DR
This work tackles the challenge of evaluating large language models without human annotations by introducing PiCO, a peer-review based unsupervised framework. It leverages open-ended questions answered by a pool of LLMs and mutual reviews to generate a data set $\mathcal{D}$ and scores $G_j$, then uses a consistency-driven optimization over learnable weights $w$ to produce a ranking $\hat{\mathcal{R}}$ that closely matches human preferences $\mathcal{R}^*$. The method includes an unsupervised elimination mechanism to prune unreliable reviewers and a validation showing that weighting more capable models improves alignment, reducing bias as measured by the preference gap. Across MT-Bench, Chatbot Arena, and AlpacaEval, PiCO outperforms baselines on rank-based metrics ($S$, $\tau$, $H$), with token usage comparable to baselines and strong stability under different seeds. This annotation-free approach offers scalable, bias-resistant LLM evaluation and can be extended to multi-modal models downstream.
Abstract
Existing large language models (LLMs) evaluation methods typically focus on testing the performance on some closed-environment and domain-specific benchmarks with human annotations. In this paper, we explore a novel unsupervised evaluation direction, utilizing peer-review mechanisms to measure LLMs automatically. In this setting, both open-source and closed-source LLMs lie in the same environment, capable of answering unlabeled questions and evaluating each other, where each LLM's response score is jointly determined by other anonymous ones. To obtain the ability hierarchy among these models, we assign each LLM a learnable capability parameter to adjust the final ranking. We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores. The key assumption behind is that high-level LLM can evaluate others' answers more accurately than low-level ones, while higher-level LLM can also achieve higher response scores. Moreover, we propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings. We perform experiments on multiple datasets with these metrics, validating the effectiveness of the proposed approach.
