Table of Contents
Fetching ...

KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

Zhuohao Yu, Chang Gao, Wenjin Yao, Yidong Wang, Wei Ye, Jindong Wang, Xing Xie, Yue Zhang, Shikun Zhang

TL;DR

KIEval introduces a knowledge-grounded interactive framework for evaluating LLMs, addressing data contamination by using a dynamic interactor-driven dialogue whose outcomes are scored by a separate evaluator. The approach demonstrates strong alignment with human judgments and reveals that contamination does not enhance genuine understanding, while existing detection methods struggle to catch fine-tuning contamination. Through extensive experiments across multiple models and datasets, KIEval shows improved differentiation of real capabilities beyond static benchmarks and MT-Bench, offering a scalable protocol with transparent metrics and reproducible prompts. The work suggests a shift toward evaluating reasoning and knowledge application in open-ended conversations to obtain more reliable assessments of real-world model performance.

Abstract

Automatic evaluation methods for large language models (LLMs) are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, which incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Starting with a question in a conventional LLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically generated, multi-round, and knowledge-focused dialogues to determine whether a model's response is merely a recall of benchmark answers or demonstrates a deep comprehension to apply knowledge in more complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. We also reveal that data contamination brings no contribution or even negative effect to models' real-world applicability and understanding, and existing contamination detection methods for LLMs can only identify contamination in pre-training but not during supervised fine-tuning.

KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

TL;DR

KIEval introduces a knowledge-grounded interactive framework for evaluating LLMs, addressing data contamination by using a dynamic interactor-driven dialogue whose outcomes are scored by a separate evaluator. The approach demonstrates strong alignment with human judgments and reveals that contamination does not enhance genuine understanding, while existing detection methods struggle to catch fine-tuning contamination. Through extensive experiments across multiple models and datasets, KIEval shows improved differentiation of real capabilities beyond static benchmarks and MT-Bench, offering a scalable protocol with transparent metrics and reproducible prompts. The work suggests a shift toward evaluating reasoning and knowledge application in open-ended conversations to obtain more reliable assessments of real-world model performance.

Abstract

Automatic evaluation methods for large language models (LLMs) are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, which incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Starting with a question in a conventional LLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically generated, multi-round, and knowledge-focused dialogues to determine whether a model's response is merely a recall of benchmark answers or demonstrates a deep comprehension to apply knowledge in more complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. We also reveal that data contamination brings no contribution or even negative effect to models' real-world applicability and understanding, and existing contamination detection methods for LLMs can only identify contamination in pre-training but not during supervised fine-tuning.
Paper Structure (25 sections, 1 equation, 5 figures, 19 tables, 1 algorithm)

This paper contains 25 sections, 1 equation, 5 figures, 19 tables, 1 algorithm.

Figures (5)

  • Figure 1: The pipeline of KIEval compared to previous static dataset-based and LLM-based evaluation methods.
  • Figure 2: Detailed evaluation result using KIEval, including the overall KIEval score, and KIEval scores for aspects: Accuracy, Logic, Relevance, Coherence and Conciseness. In comparison, we also provide dataset accuracies (5-shot). Due to page limits and the large volume of experimental data, the complete results are put in Appendix \ref{['appendix:complete_experiments']}.
  • Figure 3: Statistics on reasons to trigger early stopping given by the evaluator model.
  • Figure 4: Scatter plots of KIEval scores and traditional benchmark scores by model and dataset. Each point represents the performance of a model on a specific dataset, measured by the KIEval score and accuracy score (5-shot). Regression lines are plotted for each dataset. Points significantly above the regression line indicate the performance gap not captured by traditional benchmarks but captured by KIEval, while points significantly below the regression line indicate potential data contamination in traditional benchmarks.
  • Figure 5: The full system prompt for interactor, candidate and evaluator models.