Table of Contents
Fetching ...

Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy

Saeid Asgari Taghanaki, Joao Monteiro

TL;DR

The paper introduces Explain-Query-Test (EQT), a self-evaluation framework that decouples explanation generation from question answering to probe true comprehension in LLMs. By generating explanations, deriving self-contained questions, and answering those questions, EQT measures internal knowledge consistency via full-loop accuracy and Answer Consistency Score (ACS). Empirical results show a meaningful, though imperfect, correlation between EQT performance and established benchmarks like MMLU-Pro, and reveal a notable gap between the quality of explanations and the ability to reason about them. The findings support EQT as a data-efficient proxy for ranking LLMs and highlight areas in need of improved internal knowledge representation and reasoning capabilities.

Abstract

Large language models (LLMs) have demonstrated remarkable proficiency in generating detailed and coherent explanations of complex concepts. However, the extent to which these models truly comprehend the concepts they articulate remains unclear. To assess the level of comprehension of a model relative to the content it generates, we implemented a self-evaluation pipeline where models: (i) given a topic generate an excerpt with information about the topic, (ii) given an excerpt generate question-answer pairs, and finally (iii) given a question generate an answer. We refer to this self-evaluation approach as Explain-Query-Test (EQT). Interestingly, the accuracy on generated questions resulting from running the EQT pipeline correlates strongly with the model performance as verified by typical benchmarks such as MMLU-Pro. In other words, EQT's performance is predictive of MMLU-Pro's, and EQT can be used to rank models without the need for any external source of evaluation data other than lists of topics of interest. Moreover, our results reveal a disparity between the models' ability to produce detailed explanations and their performance on questions related to those explanations. This gap highlights fundamental limitations in the internal knowledge representation and reasoning abilities of current LLMs. We release the code at https://github.com/asgsaeid/EQT.

Explain-Query-Test: Self-Evaluating LLMs Via Explanation and Comprehension Discrepancy

TL;DR

The paper introduces Explain-Query-Test (EQT), a self-evaluation framework that decouples explanation generation from question answering to probe true comprehension in LLMs. By generating explanations, deriving self-contained questions, and answering those questions, EQT measures internal knowledge consistency via full-loop accuracy and Answer Consistency Score (ACS). Empirical results show a meaningful, though imperfect, correlation between EQT performance and established benchmarks like MMLU-Pro, and reveal a notable gap between the quality of explanations and the ability to reason about them. The findings support EQT as a data-efficient proxy for ranking LLMs and highlight areas in need of improved internal knowledge representation and reasoning capabilities.

Abstract

Large language models (LLMs) have demonstrated remarkable proficiency in generating detailed and coherent explanations of complex concepts. However, the extent to which these models truly comprehend the concepts they articulate remains unclear. To assess the level of comprehension of a model relative to the content it generates, we implemented a self-evaluation pipeline where models: (i) given a topic generate an excerpt with information about the topic, (ii) given an excerpt generate question-answer pairs, and finally (iii) given a question generate an answer. We refer to this self-evaluation approach as Explain-Query-Test (EQT). Interestingly, the accuracy on generated questions resulting from running the EQT pipeline correlates strongly with the model performance as verified by typical benchmarks such as MMLU-Pro. In other words, EQT's performance is predictive of MMLU-Pro's, and EQT can be used to rank models without the need for any external source of evaluation data other than lists of topics of interest. Moreover, our results reveal a disparity between the models' ability to produce detailed explanations and their performance on questions related to those explanations. This gap highlights fundamental limitations in the internal knowledge representation and reasoning abilities of current LLMs. We release the code at https://github.com/asgsaeid/EQT.
Paper Structure (18 sections, 3 equations, 3 figures, 1 table, 1 algorithm)

This paper contains 18 sections, 3 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Comparison of EQT Accuracy across corresponding MMLU-Pro Categories.
  • Figure 2: Accuracy (%) comparison across MMLU-Pro categories for various language models. Each model is representedf by two bars: the first (solid) represents the original MMLU-Pro accuracy, and the second (hatched) indicates the adjusted accuracy due to the application of EQT since new questions are added.
  • Figure 3: Analysis of MMLU-Pro and EQT results.