Table of Contents
Fetching ...

Digital Socrates: Evaluating LLMs through Explanation Critiques

Yuling Gu, Oyvind Tafjord, Peter Clark

TL;DR

This work tackles evaluating LLM explanations beyond raw answer accuracy by introducing explanation critiquing, a structured framework that localizes flaws, categorizes them, and provides actionable guidance. It presents Digital Socrates (DS-7B and DS-13B) as open-source critique models trained on the DS Critique Bank, a large human-verified dataset of explanation critiques spanning science and commonsense domains. The dataset enables reference-free assessment of reasoning quality via a five-tuple critique (f_loc, f_dim, s_gen, s_spec, E_{SC}) and supports evaluation without costly API calls. Empirical results show GPT-4 yields high-quality critiques closely aligned with human judgments, while smaller DS models achieve competitive performance and notable agreement with humans, demonstrating scalable, interpretable evaluation of model explanations. Overall, the work provides a new, practical toolset for diagnosing and improving explanation behavior in QA and reasoning tasks.

Abstract

While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on expensive API calls or human annotations. Our approach is to (a) define the new task of explanation critiquing - identifying and categorizing any main flaw in an explanation and providing suggestions to address the flaw, (b) create a sizeable, human-verified dataset for this task, and (c) train an open-source, automatic critique model (called Digital Socrates) using this data. Through quantitative and qualitative analysis, we demonstrate how Digital Socrates is useful for revealing insights about student models by examining their reasoning chains, and how it can provide high-quality, nuanced, automatic evaluation of those model explanations for the first time. Digital Socrates thus fills an important gap in evaluation tools for understanding and improving the explanation behavior of models.

Digital Socrates: Evaluating LLMs through Explanation Critiques

TL;DR

This work tackles evaluating LLM explanations beyond raw answer accuracy by introducing explanation critiquing, a structured framework that localizes flaws, categorizes them, and provides actionable guidance. It presents Digital Socrates (DS-7B and DS-13B) as open-source critique models trained on the DS Critique Bank, a large human-verified dataset of explanation critiques spanning science and commonsense domains. The dataset enables reference-free assessment of reasoning quality via a five-tuple critique (f_loc, f_dim, s_gen, s_spec, E_{SC}) and supports evaluation without costly API calls. Empirical results show GPT-4 yields high-quality critiques closely aligned with human judgments, while smaller DS models achieve competitive performance and notable agreement with humans, demonstrating scalable, interpretable evaluation of model explanations. Overall, the work provides a new, practical toolset for diagnosing and improving explanation behavior in QA and reasoning tasks.

Abstract

While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on expensive API calls or human annotations. Our approach is to (a) define the new task of explanation critiquing - identifying and categorizing any main flaw in an explanation and providing suggestions to address the flaw, (b) create a sizeable, human-verified dataset for this task, and (c) train an open-source, automatic critique model (called Digital Socrates) using this data. Through quantitative and qualitative analysis, we demonstrate how Digital Socrates is useful for revealing insights about student models by examining their reasoning chains, and how it can provide high-quality, nuanced, automatic evaluation of those model explanations for the first time. Digital Socrates thus fills an important gap in evaluation tools for understanding and improving the explanation behavior of models.
Paper Structure (38 sections, 1 equation, 15 figures, 12 tables)

This paper contains 38 sections, 1 equation, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Given a multiple-choice question (together with the answer options and correct answer), as well as a model-generated reasoning chain and answer, our system Digital Socrates gives a critique of the model-generated explanation. In its critiques, Digital Socrates provides localized feedback on where and why reasoning chains are flawed (focusing on the main flaw, if any), accompanied by general and fine-grained suggestions to address the identified flaw, providing nuance and interpretability to the critiques.
  • Figure 2: In student models, (human-annotated) explanation scores $E_{SC}$ vary greatly within cases where models get the answer right (accuracy = 1) or wrong (accuracy = 0). Even when a model gets the answer correct, its reasoning chain can contain varying degrees of flaws. On the other hand, when a model is incorrect in its answer, it could still make some valid points.
  • Figure 3: The pie charts show distributions of explanation flaws across all student models. Even when models get the answer correct, they may still make errors in their reasoning chain (left). When models answer incorrectly, explanation critiquing helps in categorizing and diagnosing errors in the reasoning chain (right).
  • Figure 4: GPT-3.5 and Llama-2-70B student models achieve comparable $Acc$ on Science datasets, with the latter having slightly lower $E_{SC}$. They also show different patterns in their explanations flaws, e.g., in the amount of incorrect information vs inconsistent answer.
  • Figure 5: Sample explanation critique from DS-13B.
  • ...and 10 more figures