Table of Contents
Fetching ...

The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate

Juhyun Oh, Eunsu Kim, Inha Cha, Alice Oh

TL;DR

The paper interrogates whether state-of-the-art generative LLMs can reliably evaluate free-form QA outputs. By deploying three LLMs and one open LM on TriviaQA across generation and reference-free evaluation, the study uncovers a persistent gap: models perform worse at evaluation than generation and can produce unfaithful judgments. It introduces the Generative AI Paradox, showing cases where models correctly generate but mis-evaluate, or vice versa, and analyzes faithfulness, self-knowledge limits, and grading consistency. The findings highlight the need for careful design and assessment of LLM evaluators, and point to future work across tasks and improved rubric-based, faithfulness-aware evaluation frameworks with broader applicability and reliability.

Abstract

This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared to generation tasks. Intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of LLMs as evaluators. This study contributes to the understanding of "the Generative AI Paradox" (West et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.

The Generative AI Paradox on Evaluation: What It Can Solve, It May Not Evaluate

TL;DR

The paper interrogates whether state-of-the-art generative LLMs can reliably evaluate free-form QA outputs. By deploying three LLMs and one open LM on TriviaQA across generation and reference-free evaluation, the study uncovers a persistent gap: models perform worse at evaluation than generation and can produce unfaithful judgments. It introduces the Generative AI Paradox, showing cases where models correctly generate but mis-evaluate, or vice versa, and analyzes faithfulness, self-knowledge limits, and grading consistency. The findings highlight the need for careful design and assessment of LLM evaluators, and point to future work across tasks and improved rubric-based, faithfulness-aware evaluation frameworks with broader applicability and reliability.

Abstract

This paper explores the assumption that Large Language Models (LLMs) skilled in generation tasks are equally adept as evaluators. We assess the performance of three LLMs and one open-source LM in Question-Answering (QA) and evaluation tasks using the TriviaQA (Joshi et al., 2017) dataset. Results indicate a significant disparity, with LLMs exhibiting lower performance in evaluation tasks compared to generation tasks. Intriguingly, we discover instances of unfaithful evaluation where models accurately evaluate answers in areas where they lack competence, underscoring the need to examine the faithfulness and trustworthiness of LLMs as evaluators. This study contributes to the understanding of "the Generative AI Paradox" (West et al., 2023), highlighting a need to explore the correlation between generative excellence and evaluation proficiency, and the necessity to scrutinize the faithfulness aspect in model evaluations.
Paper Structure (28 sections, 2 equations, 4 figures, 3 tables)

This paper contains 28 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Examples of GPT-4's Generative AI paradox in evaluation. Case 1 demonstrates a paradox where the Generation is correct but the Evaluation is incorrect, while Case 2 shows the opposite paradox with the Generation being incorrect but the Evaluation being correct.
  • Figure 2: Results of how Evaluator models rated the answers of Evaluatees in samples that were correctly SOLVED by the Evaluator. Each three models indicated in the Evaluatee column represents the "Evaluatees" assessed by the Evaluators in the same row. Accuracy values were expected to be 1, but this was not achieved in all Evaluator models.
  • Figure 3: GPT-4 evaluates Vicuna-13b's output that does not directly answer the question, but includes the golden answer, as "Incorrect".
  • Figure 4: GPT-4 evaluates Vicuna-13b's output that does not directly answers the question, but includes the golden answer, as "I don't know".