Table of Contents
Fetching ...

Benchmarking LLMs' Judgments with No Gold Standard

Shengwei Xu, Yuxuan Lu, Grant Schoenebeck, Yuqing Kong

TL;DR

The paper presents GEM, a no-gold-standard evaluator for LLM judgments based on Shannon mutual information, estimated via a conditional PMI computed with an evaluation-LM. GEM-S further conditions on a task synopsis to emphasize semantic information beyond superficial content, and preprocessing is used to filter shortcuts. The authors demonstrate that GEM and GEM-S correlate with human judgments, are sensitive to semantic degradations, and resist manipulation better than baselines, including GPT-4o Examiner. They also introduce GRE-bench, a yearly peer-review benchmark built on GEM/GEM-S with open-access data to mitigate data leakage and enable scalable evaluation of LLMs' peer-review capabilities.

Abstract

We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by Large Language Models (LLMs), particularly in generating informative judgments, without the need for a gold standard reference. GEM broadens the scenarios where we can benchmark LLM generation performance-from traditional ones, like machine translation and summarization, where gold standard references are readily available, to subjective tasks without clear gold standards, such as academic peer review. GEM uses a generative model to estimate mutual information between candidate and reference responses, without requiring the reference to be a gold standard. In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner, and outperforms all other baselines. Additionally, GEM is more robust against strategic manipulations, such as rephrasing or elongation, which can artificially inflate scores under a GPT-4o Examiner. We also present GRE-bench (Generating Review Evaluation Benchmark) which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers. Because GRE-bench is based upon GEM, it inherits its robustness properties. Additionally, GRE-bench circumvents data contamination problems (or data leakage) by using the continuous influx of new open-access research papers and peer reviews each year. We show GRE-bench results of various popular LLMs on their peer review capabilities using the ICLR2023 dataset.

Benchmarking LLMs' Judgments with No Gold Standard

TL;DR

The paper presents GEM, a no-gold-standard evaluator for LLM judgments based on Shannon mutual information, estimated via a conditional PMI computed with an evaluation-LM. GEM-S further conditions on a task synopsis to emphasize semantic information beyond superficial content, and preprocessing is used to filter shortcuts. The authors demonstrate that GEM and GEM-S correlate with human judgments, are sensitive to semantic degradations, and resist manipulation better than baselines, including GPT-4o Examiner. They also introduce GRE-bench, a yearly peer-review benchmark built on GEM/GEM-S with open-access data to mitigate data leakage and enable scalable evaluation of LLMs' peer-review capabilities.

Abstract

We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by Large Language Models (LLMs), particularly in generating informative judgments, without the need for a gold standard reference. GEM broadens the scenarios where we can benchmark LLM generation performance-from traditional ones, like machine translation and summarization, where gold standard references are readily available, to subjective tasks without clear gold standards, such as academic peer review. GEM uses a generative model to estimate mutual information between candidate and reference responses, without requiring the reference to be a gold standard. In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner, and outperforms all other baselines. Additionally, GEM is more robust against strategic manipulations, such as rephrasing or elongation, which can artificially inflate scores under a GPT-4o Examiner. We also present GRE-bench (Generating Review Evaluation Benchmark) which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers. Because GRE-bench is based upon GEM, it inherits its robustness properties. Additionally, GRE-bench circumvents data contamination problems (or data leakage) by using the continuous influx of new open-access research papers and peer reviews each year. We show GRE-bench results of various popular LLMs on their peer review capabilities using the ICLR2023 dataset.

Paper Structure

This paper contains 67 sections, 3 theorems, 19 equations, 6 figures, 15 tables, 2 algorithms.

Key Result

Proposition 3.0

When the KL-divergenceThe KL-divergence between two distributions over the same probability space is $D_{\text{KL}}(P \| Q) = \sum_{x} P(x) \log \left(P(x)/Q(x)\right).$ between the LLM estimated distribution and the underlying distribution satisfies For the two candidates $H$ and $L$ discussed above, the information structure of $H$ Blackwell dominates $L$'s, when the size of dataset $n$ goes to

Figures (6)

  • Figure 1: Our model
  • Figure 2: Example of Hierarchical Information Structure in Peer Reviews
  • Figure 3: An Overview of Our Generative Estimator for Mutual Information
  • Figure 4: Meaningless Elongation
  • Figure 5: Results of GRE-bench based on three evaluation metrics with 90% confidence intervals vs. model parameter sizesindicated by color. The grey line represents the average human baseline, with the 90% confidence interval shaded in grey.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Proposition 3.0
  • Proposition B.0
  • proof : Proof of Proposition \ref{['prop:GEM']}
  • Proposition B.0
  • proof