PRE: A Peer Review Based Large Language Model Evaluator
Zhumin Chu, Qingyao Ai, Yiteng Tu, Haitao Li, Yiqun Liu
TL;DR
This work introduces PRE, a peer-review-inspired framework for automatic evaluation of large language models that mitigates cost, generalizability, and bias issues common to human- or model-based evaluators. PRE comprises a qualification exam to select reliable reviewer LLMs, a peer review module where these reviewers rate or compare evaluatee outputs, and a chair-style weighted aggregation to produce final rankings. Empirical results on XSum and NF-CATS demonstrate that PRE achieves higher agreement with human judgments than baselines, while revealing biases present when relying on a single evaluator; PRE also shows robustness to parameter changes and potential for unsupervised qualification methods. The framework is generalizable, cost-efficient, and capable of reducing evaluator bias, making it practical for scalable, long-term LLM assessment across tasks.
Abstract
The impressive performance of large language models (LLMs) has attracted considerable attention from the academic and industrial communities. Besides how to construct and train LLMs, how to effectively evaluate and compare the capacity of LLMs has also been well recognized as an important yet difficult problem. Existing paradigms rely on either human annotators or model-based evaluators to evaluate the performance of LLMs on different tasks. However, these paradigms often suffer from high cost, low generalizability, and inherited biases in practice, which make them incapable of supporting the sustainable development of LLMs in long term. In order to address these issues, inspired by the peer review systems widely used in academic publication process, we propose a novel framework that can automatically evaluate LLMs through a peer-review process. Specifically, for the evaluation of a specific task, we first construct a small qualification exam to select "reviewers" from a couple of powerful LLMs. Then, to actually evaluate the "submissions" written by different candidate LLMs, i.e., the evaluatees, we use the reviewer LLMs to rate or compare the submissions. The final ranking of evaluatee LLMs is generated based on the results provided by all reviewers. We conducted extensive experiments on text summarization tasks with eleven LLMs including GPT-4. The results demonstrate the existence of biasness when evaluating using a single LLM. Also, our PRE model outperforms all the baselines, illustrating the effectiveness of the peer review mechanism.
