Dynamic Evaluation of Large Language Models by Meta Probing Agents

Kaijie Zhu; Jindong Wang; Qinlin Zhao; Ruochen Xu; Xing Xie

Dynamic Evaluation of Large Language Models by Meta Probing Agents

Kaijie Zhu, Jindong Wang, Qinlin Zhao, Ruochen Xu, Xing Xie

TL;DR

The paper introduces Meta Probing Agents (MPA), a psychometrics-inspired dynamic evaluation protocol that uses probing and judge agents to transform and validate evaluation samples across language understanding, problem solving, and domain knowledge. Through experiments on MMLU, ARC-C, GSM8K, and BBH, MPA reveals systematic performance degradation due to data contamination, reveals strong correlations among abilities, and uncovers a Matthew effect linking model size to these correlations. The framework also demonstrates potential as a data augmentation approach, improving some models when trained on MPA-generated samples. Overall, MPA offers a flexible, interpretable path toward fine-grained LLM evaluation and targeted improvement, while highlighting limitations and directions for future work in broader task coverage and robustness of the judge mechanism.

Abstract

Evaluation of large language models (LLMs) has raised great concerns in the community due to the issue of data contamination. Existing work designed evaluation protocols using well-defined algorithms for specific tasks, which cannot be easily extended to diverse scenarios. Moreover, current evaluation benchmarks can only provide the overall benchmark results and cannot support a fine-grained and multifaceted analysis of LLMs' abilities. In this paper, we propose meta probing agents (MPA), a general dynamic evaluation protocol inspired by psychometrics to evaluate LLMs. MPA is the key component of DyVal 2, which naturally extends the previous DyVal~\citep{zhu2023dyval}. MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory on three basic cognitive abilities: language understanding, problem solving, and domain knowledge. These basic abilities are also dynamically configurable, allowing multifaceted analysis. We conducted extensive evaluations using MPA and found that most LLMs achieve poorer performance, indicating room for improvement. Our multifaceted analysis demonstrated the strong correlation between the basic abilities and an implicit Matthew effect on model size, i.e., larger models possess stronger correlations of the abilities. MPA can also be used as a data augmentation approach to enhance LLMs. Code is available at: https://github.com/microsoft/promptbench.

Dynamic Evaluation of Large Language Models by Meta Probing Agents

TL;DR

Abstract

Paper Structure (41 sections, 9 figures, 8 tables)

This paper contains 41 sections, 9 figures, 8 tables.

Introduction
Related Work
Method
Overview
Probing Agent
Judge Agent
Human Verification
Psychometric principles
Language Understanding
Problem Solving
Domain Knowledge
Experiments
Experimental Setup
Main Results
Effect of Different Probing Principles
...and 26 more sections

Figures (9)

Figure 1: Performance of different LLMs on vanilla MMLU testset and our probing benchmarks based on the MMLU. LU, PS, and DK denote the evaluation sets to evaluate language understanding, problem solving, and domain knowledge ability, respectively.
Figure 2: Inspired by psychometric theory on the three basic cognitive abilities, our Meta Probing Agent (MPA) designs corresponding principles that transforms original benchmarks into a new one. These principles can be flexibly combined to create various probing benchmarks for multifaceted analysis. Subfigure (c) shows how MPA generates the new sample given an existing sample from ARC-C.
Figure 3: The confusion matrix of original benchmarks and probing benchmarks on ARC-C dataset.
Figure 4: The relative effectiveness of different principles on MMLU and ARC-C dataset.
Figure 5: The accuracy of different LLMs on ARC-C and MMLU on different levels of probing benchmarks. LU, LU+PS, and LU+PS+DK represent probing benchmarks that applied language understanding principles, both language understanding principles and problem solving principles, and all principles, respectively.
...and 4 more figures

Dynamic Evaluation of Large Language Models by Meta Probing Agents

TL;DR

Abstract

Dynamic Evaluation of Large Language Models by Meta Probing Agents

Authors

TL;DR

Abstract

Table of Contents

Figures (9)