Table of Contents
Fetching ...

BatchEval: Towards Human-like Text Evaluation

Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Boyuan Pan, Heda Wang, Kan Li

TL;DR

BatchEval tackles the misalignment between human evaluation and traditional LLM-based sample-wise evaluation by introducing batch-wise evaluation with iterative heterogeneous batches. It shows that a two-stage procedure, a heterogeneous batch composition strategy, and decimal scoring yield the best performance, balancing robustness, discrimination, and cost. Across three LLMs and four text-evaluation tasks, BatchEval yields about 10.5% gains in Pearson and 7.1% gains in Spearman correlations with humans at roughly 64% of baseline API cost. Attentional analyses and robustness experiments suggest that in-batch references enable both more accurate single predictions and richer ensemble behavior, advancing human-like text evaluation in a scalable way.

Abstract

Significant progress has been made in automatic text evaluation with the introduction of large language models (LLMs) as evaluators. However, current sample-wise evaluation paradigm suffers from the following issues: (1) Sensitive to prompt design; (2) Poor resistance to noise; (3) Inferior ensemble performance with static reference. Inspired by the fact that humans treat both criterion definition and inter sample comparison as references for evaluation, we propose BatchEval, a paradigm that conducts batch-wise evaluation iteratively to alleviate the above problems. We explore variants under this paradigm and confirm the optimal settings are two stage procedure with heterogeneous batch composition strategy and decimal scoring format. Comprehensive experiments across 3 LLMs on 4 text evaluation tasks demonstrate that BatchEval outperforms state-of-the-art methods by 10.5% on Pearson correlations with only 64% API cost on average. Further analyses have been conducted to verify the robustness, generalization, and working mechanism of BatchEval.

BatchEval: Towards Human-like Text Evaluation

TL;DR

BatchEval tackles the misalignment between human evaluation and traditional LLM-based sample-wise evaluation by introducing batch-wise evaluation with iterative heterogeneous batches. It shows that a two-stage procedure, a heterogeneous batch composition strategy, and decimal scoring yield the best performance, balancing robustness, discrimination, and cost. Across three LLMs and four text-evaluation tasks, BatchEval yields about 10.5% gains in Pearson and 7.1% gains in Spearman correlations with humans at roughly 64% of baseline API cost. Attentional analyses and robustness experiments suggest that in-batch references enable both more accurate single predictions and richer ensemble behavior, advancing human-like text evaluation in a scalable way.

Abstract

Significant progress has been made in automatic text evaluation with the introduction of large language models (LLMs) as evaluators. However, current sample-wise evaluation paradigm suffers from the following issues: (1) Sensitive to prompt design; (2) Poor resistance to noise; (3) Inferior ensemble performance with static reference. Inspired by the fact that humans treat both criterion definition and inter sample comparison as references for evaluation, we propose BatchEval, a paradigm that conducts batch-wise evaluation iteratively to alleviate the above problems. We explore variants under this paradigm and confirm the optimal settings are two stage procedure with heterogeneous batch composition strategy and decimal scoring format. Comprehensive experiments across 3 LLMs on 4 text evaluation tasks demonstrate that BatchEval outperforms state-of-the-art methods by 10.5% on Pearson correlations with only 64% API cost on average. Further analyses have been conducted to verify the robustness, generalization, and working mechanism of BatchEval.
Paper Structure (67 sections, 2 theorems, 8 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 67 sections, 2 theorems, 8 equations, 8 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

The robustness against noise correlates positively with the uniformity of evaluator scoring distribution. (See Appendix sec:proofoft1 for derivation in details)

Figures (8)

  • Figure 1: Human evaluators evaluate text quality based on criterion definition and sample comparison, while current LLM-based evaluators only rely on criterion.
  • Figure 2: Overall illustration of BatchEval.
  • Figure 3: Score distribution and corresponding entropy ($-\sum_{s} p(s) \log_2 p(s)$) of different methods.
  • Figure 4: Comparisons between BatchEval and CloserLook from the perspective of Theorem \ref{['theorem2']}.
  • Figure 5: Average batch bias of different strategies.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • Proof 1