Poor-Supervised Evaluation for SuperLLM via Mutual Consistency

Peiwen Yuan; Shaoxiong Feng; Yiwei Li; Xinglin Wang; Boyuan Pan; Heda Wang; Yao Hu; Kan Li

Poor-Supervised Evaluation for SuperLLM via Mutual Consistency

Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Boyuan Pan, Heda Wang, Yao Hu, Kan Li

TL;DR

PoEM has advanced the evaluation paradigm evolution from human-centric to human&model-centric by treating both of them as reference models, mitigating the limitations of human evaluation in the era of LLMs.

Abstract

The guidance from capability evaluations has greatly propelled the progress of both human society and Artificial Intelligence. However, as LLMs evolve, it becomes challenging to construct evaluation benchmarks for them with accurate labels on hard tasks that approach the boundaries of human capabilities. To credibly conduct evaluation without accurate labels (denoted as poor-supervised evaluation), we propose the PoEM framework. We first prove that the capability of a model can be equivalently assessed by the consistency between it and certain reference model, when their prediction distributions are independent and the sample size is infinite. To alleviate the insufficiencies of the conditions in reality, we further introduce an algorithm that treats humans (when available) and the models under evaluation as reference models, alternately conducting model weights calibration and filtering during E-step and M-step. Comprehensive experiments across 3 types of tasks with 16 mainstream LLMs have shown that PoEM under poor supervision can achieve an average of 0.98 Pearson correlation coefficient with supervised evaluation results, demonstrating good effectiveness, efficiency and generalizability. More generally, PoEM has advanced the evaluation paradigm evolution from human-centric to human&model-centric by treating both of them as reference models, mitigating the limitations of human evaluation in the era of LLMs.

Poor-Supervised Evaluation for SuperLLM via Mutual Consistency

TL;DR

Abstract

Paper Structure (41 sections, 1 theorem, 21 equations, 5 figures, 7 tables)

This paper contains 41 sections, 1 theorem, 21 equations, 5 figures, 7 tables.

Introduction
Related Work
Poor-supervised Evaluation
Model Consistency
PoEM Framework
Task Definition
Equating Mutual Consistency with Capability
Results and Insights of Preliminary Experiments
Algorithms for Mitigating Insufficient Conditions
Naive Ensemble
Weight Calibration
Reference Model Filtering
EM-based Integration
Initialization.
Optimizing Objective.
...and 26 more sections

Key Result

Theorem 1

When the size of $X$ is infinite (Condition 1), if certain reference model $\dot{\mathcal{M}}$ performs better than random guessing (Condition 2) and its predictions are independent of models $\{\mathcal{M}^i\}_{i=1}^L$ under evaluation (Condition 3), the following equation holds (See Appendix §theo where $Cons(\cdot)$ denotes mutual consistency (See calculation methods in §exp for different tasks

Figures (5)

Figure 1: Schematic diagram of performance of humans and models with varying task difficulty. It is becoming increasingly difficult for humans to accurately evaluate LLMs (offer accurate supervision) with their rapid development.
Figure 2: Mutual consistency and affinity matrices on MATH-Precalculus dataset among LLMs.
Figure 3: Overall illustration of our proposed algorithms.
Figure 4: Comparisons between $r_s$ of PoEM, the mean of the absolute values of affinity matrix $Avg(Abs(A))$, and average accuracy of all the models $Avg(B)$. We plot the $20^{th}$ power of $r_s$ and $0.5-Avg(Abs(A))$ for easier observation.
Figure 5: $r_p$ and $r_s$ with error bar of PoEM as sample size changes.

Theorems & Definitions (3)

Theorem 1
Definition 1
proof

Poor-Supervised Evaluation for SuperLLM via Mutual Consistency

TL;DR

Abstract

Poor-Supervised Evaluation for SuperLLM via Mutual Consistency

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (3)