Table of Contents
Fetching ...

Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

Junjie Chen, Weihang Su, Zhumin Chu, Haitao Li, Yujia Zhou, Dingbo Yuan, Xudong Wang, Jun Zhou, Yiqun Liu, Min Zhang, Shaoping Ma, Qingyao Ai

TL;DR

Auto-PRE tackles the challenge of evaluating rapidly evolving LLMs by introducing an automatic qualification exam to select qualified evaluator LLMs. It defines three evaluator traits—Consistency, Pertinence, and Self-Confidence—and implements three corresponding automatic selection methods that operate without human annotations, enabling cost-efficient and scalable LLM-based evaluation. Across open-ended tasks (XSum, NF_CATS, DailyDialog), Auto-PRE achieves state-of-the-art performance while substantially reducing evaluation costs and mitigating biases associated with single-series evaluators. The framework’s demonstrated synergy among the three selection methods and its detailed appendix provide a practical foundation for future LLMs-as-judges in robust, scalable evaluation pipelines.

Abstract

The rapid development of large language models (LLMs) has highlighted the need for efficient and reliable methods to evaluate their performance. Traditional evaluation methods often face challenges like high costs, limited task formats, dependence on human references, and systematic biases. To address these limitations, we propose Auto-PRE, an automatic LLM evaluation framework inspired by the peer review process. Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluator LLMs based on three core traits: consistency, pertinence, and self-confidence, which correspond to the instruction, content, and response stages, respectively, and collectively cover the entire evaluation process. Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance while significantly reducing evaluation costs. Furthermore, the structured and scalable design of our automatic qualification exam framework provides valuable insights into automating the evaluation of LLMs-as-judges, paving the way for more advanced LLM-based evaluation frameworks.

Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

TL;DR

Auto-PRE tackles the challenge of evaluating rapidly evolving LLMs by introducing an automatic qualification exam to select qualified evaluator LLMs. It defines three evaluator traits—Consistency, Pertinence, and Self-Confidence—and implements three corresponding automatic selection methods that operate without human annotations, enabling cost-efficient and scalable LLM-based evaluation. Across open-ended tasks (XSum, NF_CATS, DailyDialog), Auto-PRE achieves state-of-the-art performance while substantially reducing evaluation costs and mitigating biases associated with single-series evaluators. The framework’s demonstrated synergy among the three selection methods and its detailed appendix provide a practical foundation for future LLMs-as-judges in robust, scalable evaluation pipelines.

Abstract

The rapid development of large language models (LLMs) has highlighted the need for efficient and reliable methods to evaluate their performance. Traditional evaluation methods often face challenges like high costs, limited task formats, dependence on human references, and systematic biases. To address these limitations, we propose Auto-PRE, an automatic LLM evaluation framework inspired by the peer review process. Unlike previous approaches that rely on human annotations, Auto-PRE automatically selects evaluator LLMs based on three core traits: consistency, pertinence, and self-confidence, which correspond to the instruction, content, and response stages, respectively, and collectively cover the entire evaluation process. Experiments on three representative tasks, including summarization, non-factoid QA, and dialogue generation, demonstrate that Auto-PRE achieves state-of-the-art performance while significantly reducing evaluation costs. Furthermore, the structured and scalable design of our automatic qualification exam framework provides valuable insights into automating the evaluation of LLMs-as-judges, paving the way for more advanced LLM-based evaluation frameworks.

Paper Structure

This paper contains 25 sections, 5 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: Comparison of existing collaborative evaluation methods. Our Auto-PRE offers advantages in reducing bias and lowering cost.
  • Figure 2: The framework of our automatic qualification exam. (1) Consistency measures the proportion of consistent outputs by the LLM after swapping answer positions in prompts; (2) Pertinence assesses whether the LLM evaluates based on the pertinence of answers to the question, unaffected by their superficial quality; (3) Self-Confidence determines if the LLM exhibits higher confidence on easier question sets when facing two sets of the same format but objectively different difficulties.
  • Figure 3: The performance on the Xsum (pairwise).
  • Figure 4: The left vertical axis represents LLM uncertainty. The right vertical axis shows LLM accuracy in manual annotation-based qualification exams. The horizontal axis displays experimental groups, divided into easy and hard sets (e.g., glm3_e_p2 denotes ChatGLM3-6B with prompt2 on the easy set). Accuracy is marked by red triangles, while uncertainty is illustrated using box plots williamson1989box. A paired box with a higher median (blue line) on the left than on the right indicates unreasonable confidence levels.
  • Figure 5: The performance on the NF_CATS (pairwise).