PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

Jiatong Li; Renjun Hu; Kunzhe Huang; Yan Zhuang; Qi Liu; Mengxiao Zhu; Xing Shi; Wei Lin

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

Jiatong Li, Renjun Hu, Kunzhe Huang, Yan Zhuang, Qi Liu, Mengxiao Zhu, Xing Shi, Wei Lin

TL;DR

PertEval is presented, a toolkit devised for in-depth probing of LLMs' knowledge capacity through knowledge-invariant perturbations that retains LLMs' uncertainty to specious knowledge, and reveals their potential rote memorization to correct options which leads to overestimated performance.

Abstract

Expert-designed close-ended benchmarks are indispensable in assessing the knowledge capacity of large language models (LLMs). Despite their widespread use, concerns have mounted regarding their reliability due to limited test scenarios and an unavoidable risk of data contamination. To rectify this, we present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge capacity through \textbf{knowledge-invariant perturbations}. These perturbations employ human-like restatement techniques to generate on-the-fly test samples from static benchmarks, meticulously retaining knowledge-critical content while altering irrelevant details. Our toolkit further includes a suite of \textbf{response consistency analyses} that compare performance on raw vs. perturbed test sets to precisely assess LLMs' genuine knowledge capacity. Six representative LLMs are re-evaluated using PertEval. Results reveal significantly inflated performance of the LLMs on raw benchmarks, including an absolute 25.8% overestimation for GPT-4. Additionally, through a nuanced response pattern analysis, we discover that PertEval retains LLMs' uncertainty to specious knowledge, and reveals their potential rote memorization to correct options which leads to overestimated performance. We also find that the detailed response consistency analyses by PertEval could illuminate various weaknesses in existing LLMs' knowledge mastery and guide the development of refinement. Our findings provide insights for advancing more robust and genuinely knowledgeable LLMs. Our code is available at \url{https://github.com/aigc-apps/PertEval}.

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

TL;DR

Abstract

Paper Structure (28 sections, 1 theorem, 4 equations, 15 figures, 14 tables, 1 algorithm)

This paper contains 28 sections, 1 theorem, 4 equations, 15 figures, 14 tables, 1 algorithm.

Introduction
Related Work
Methodology
Knowledge-invariant Perturbations
Knowledge Invariance Verification
Response Consistency Analyses for Measuring Real Knowledge Capacity
Experiments
Knowledge Invariance Verification for Perturbations
Real Knowledge Capacity Evaluation
Response Pattern Analysis
Overall Performance Stability
Correct Response Consistency
Discussion
Related Work
Benchmarks for Knowledge Capacity Evaluation
...and 13 more sections

Key Result

Proposition D.1

In multiple choice questions, given $k$ options and one single correct answer for each question, the expected value of ACC@Consist is $1/k^2$ for pure guessing.

Figures (15)

Figure 1: An overview of the PertEval evaluation toolkit. PertEval uses content-level and format-level perturbations to generate perturbed dataset $D'$ from existing close-ended benchmark dataset $D$. Next, it evaluates the knowledge capacity of LLMs via response consistency analysis. PertEval also demonstrates in-depth the performance feature of LLMs via response pattern analysis.
Figure 2: (left) Perturbation-wise knowledge invariance scores$\uparrow$ by gpt-4-turbo (systematic sampling, interval = 10); (right) edit distances between original question and perturbed questions.
Figure 3: Real knowledge capacities measured by ACC@Consist$\uparrow$ with composite knowledge-invariant perturbation. ACC@Original and ACC@Perturb denote accuracy on the original and perturbed data, respectively. We report the macro-averaged results on all tested datasets.
Figure 4: Response patterns of gpt-4-turbo. Left: original data; Right: perturbed data.
Figure 5: Prompt template for the rewriter LLM. The expected similarity score is used to control parphrasing of the rewriter LLM, which is set to 0.6 in all experiments.
...and 10 more figures

Theorems & Definitions (2)

Proposition D.1
proof

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

TL;DR

Abstract

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (2)