Table of Contents
Fetching ...

PiCO: Peer Review in LLMs based on the Consistency Optimization

Kun-Peng Ning, Shuo Yang, Yu-Yang Liu, Jia-Yu Yao, Zhen-Hui Liu, Yong-Hong Tian, Yibing Song, Li Yuan

TL;DR

This work tackles the challenge of evaluating large language models without human annotations by introducing PiCO, a peer-review based unsupervised framework. It leverages open-ended questions answered by a pool of LLMs and mutual reviews to generate a data set $\mathcal{D}$ and scores $G_j$, then uses a consistency-driven optimization over learnable weights $w$ to produce a ranking $\hat{\mathcal{R}}$ that closely matches human preferences $\mathcal{R}^*$. The method includes an unsupervised elimination mechanism to prune unreliable reviewers and a validation showing that weighting more capable models improves alignment, reducing bias as measured by the preference gap. Across MT-Bench, Chatbot Arena, and AlpacaEval, PiCO outperforms baselines on rank-based metrics ($S$, $\tau$, $H$), with token usage comparable to baselines and strong stability under different seeds. This annotation-free approach offers scalable, bias-resistant LLM evaluation and can be extended to multi-modal models downstream.

Abstract

Existing large language models (LLMs) evaluation methods typically focus on testing the performance on some closed-environment and domain-specific benchmarks with human annotations. In this paper, we explore a novel unsupervised evaluation direction, utilizing peer-review mechanisms to measure LLMs automatically. In this setting, both open-source and closed-source LLMs lie in the same environment, capable of answering unlabeled questions and evaluating each other, where each LLM's response score is jointly determined by other anonymous ones. To obtain the ability hierarchy among these models, we assign each LLM a learnable capability parameter to adjust the final ranking. We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores. The key assumption behind is that high-level LLM can evaluate others' answers more accurately than low-level ones, while higher-level LLM can also achieve higher response scores. Moreover, we propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings. We perform experiments on multiple datasets with these metrics, validating the effectiveness of the proposed approach.

PiCO: Peer Review in LLMs based on the Consistency Optimization

TL;DR

This work tackles the challenge of evaluating large language models without human annotations by introducing PiCO, a peer-review based unsupervised framework. It leverages open-ended questions answered by a pool of LLMs and mutual reviews to generate a data set and scores , then uses a consistency-driven optimization over learnable weights to produce a ranking that closely matches human preferences . The method includes an unsupervised elimination mechanism to prune unreliable reviewers and a validation showing that weighting more capable models improves alignment, reducing bias as measured by the preference gap. Across MT-Bench, Chatbot Arena, and AlpacaEval, PiCO outperforms baselines on rank-based metrics (, , ), with token usage comparable to baselines and strong stability under different seeds. This annotation-free approach offers scalable, bias-resistant LLM evaluation and can be extended to multi-modal models downstream.

Abstract

Existing large language models (LLMs) evaluation methods typically focus on testing the performance on some closed-environment and domain-specific benchmarks with human annotations. In this paper, we explore a novel unsupervised evaluation direction, utilizing peer-review mechanisms to measure LLMs automatically. In this setting, both open-source and closed-source LLMs lie in the same environment, capable of answering unlabeled questions and evaluating each other, where each LLM's response score is jointly determined by other anonymous ones. To obtain the ability hierarchy among these models, we assign each LLM a learnable capability parameter to adjust the final ranking. We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores. The key assumption behind is that high-level LLM can evaluate others' answers more accurately than low-level ones, while higher-level LLM can also achieve higher response scores. Moreover, we propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings. We perform experiments on multiple datasets with these metrics, validating the effectiveness of the proposed approach.
Paper Structure (21 sections, 19 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 19 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: The framework of PiCO. In this framework, both open-source and closed-source LLMs lie in the same environment, capable of answering unlabeled questions and evaluating each other, where each LLM's response score is jointly determined by other anonymous ones. We assign each LLM a learnable capability weight to optimize the score ranking based on the consistency assumption, while reducing the entropy of the peer-review evaluation system. The goal is to find a final score ranking that all LLMs "agree" it.
  • Figure 2: The pipeline of the PiCO. It is mainly composed of two components: the peer-review and consistency optimization stages. Specifically, in the peer-review stage, the unlabeled dataset $\mathcal{Q}$ and the LLMs pool $\mathcal{M}$ are given. Then, we let all LLMs answer each unlabeled question to obtain the response set $\mathcal{A}$. We shuffle the set and construct anonymous answer pairs, while randomly selecting other LLMs to evaluate both responses with a learnable confidence $w$. As a result, we can obtain the answer-ranking data $\mathcal{D}$ which is a quadruple that records the partial order between two answers and the evaluator's confidence weight. In the consistency optimization stage, we update the parameter $w$ by maximizing the consistency of each LLM's capability and score, while re-ranking the LLMs to be closer to human rankings.
  • Figure 3: Heatmap distribution of preference gap (PG) metric among seven LLMs across three datasets. Higher values (above 0) indicate greater evaluation bias. The first row shows original PG values in three datasets, while the second row displays PG values re-weighted using our learned confidence weights.
  • Figure 4: Performance comparison of the PiCO (Ours) and PRE methods on the Chatbot Arena, MT-Bench, and AlpacaEval datasets, with the number of eliminated reviewers on the x-axis. The y-axis is PEN, where lower values indicate better performance.
  • Figure 5: The average loss for different numbers of eliminated reviewers($\downarrow$). It shows how the iterative elimination of weaker reviewers affects the overall loss in the peer-review system.
  • ...and 7 more figures