Table of Contents
Fetching ...

How good is my story? Towards quantitative metrics for evaluating LLM-generated XAI narratives

Timour Ichmoukhamedov, James Hinns, David Martens

TL;DR

This work introduces an automated framework to evaluate LLM-generated XAI narratives derived from SHAP explanations for tabular binary classification. It couples a narrative-generation step with an extraction-based validation loop and downstream metrics—Rank/Sign/Value agreement for faithfulness, perplexity for assumptions, and embedding-based cosine similarity for human similarity—complemented by an embedding-based comparison to human-written narratives. Through experiments across multiple datasets and LLMs, the authors show that long prompts improve faithfulness, that manipulated narratives degrade metric performance, and that embedding-based similarity can rival or surpass BLEURT in matching human explanations. The study demonstrates scalable, automated validation of XAI narratives and highlights directions to refine metrics, particularly for assumption plausibility and task-specific embedding-based evaluations. Overall, it provides a practical pipeline for detecting hallucinations and guiding the development of more trustworthy narrative explanations.

Abstract

A rapidly developing application of LLMs in XAI is to convert quantitative explanations such as SHAP into user-friendly narratives to explain the decisions made by smaller prediction models. Evaluating the narratives without relying on human preference studies or surveys is becoming increasingly important in this field. In this work we propose a framework and explore several automated metrics to evaluate LLM-generated narratives for explanations of tabular classification tasks. We apply our approach to compare several state-of-the-art LLMs across different datasets and prompt types. As a demonstration of their utility, these metrics allow us to identify new challenges related to LLM hallucinations for XAI narratives.

How good is my story? Towards quantitative metrics for evaluating LLM-generated XAI narratives

TL;DR

This work introduces an automated framework to evaluate LLM-generated XAI narratives derived from SHAP explanations for tabular binary classification. It couples a narrative-generation step with an extraction-based validation loop and downstream metrics—Rank/Sign/Value agreement for faithfulness, perplexity for assumptions, and embedding-based cosine similarity for human similarity—complemented by an embedding-based comparison to human-written narratives. Through experiments across multiple datasets and LLMs, the authors show that long prompts improve faithfulness, that manipulated narratives degrade metric performance, and that embedding-based similarity can rival or surpass BLEURT in matching human explanations. The study demonstrates scalable, automated validation of XAI narratives and highlights directions to refine metrics, particularly for assumption plausibility and task-specific embedding-based evaluations. Overall, it provides a practical pipeline for detecting hallucinations and guiding the development of more trustworthy narrative explanations.

Abstract

A rapidly developing application of LLMs in XAI is to convert quantitative explanations such as SHAP into user-friendly narratives to explain the decisions made by smaller prediction models. Evaluating the narratives without relying on human preference studies or surveys is becoming increasingly important in this field. In this work we propose a framework and explore several automated metrics to evaluate LLM-generated narratives for explanations of tabular classification tasks. We apply our approach to compare several state-of-the-art LLMs across different datasets and prompt types. As a demonstration of their utility, these metrics allow us to identify new challenges related to LLM hallucinations for XAI narratives.

Paper Structure

This paper contains 8 sections, 2 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Overview of the workflow presented in the paper. First, a narrative based on a SHAP input table is generated by the generation LLM, after which an extraction LLM is independently used to extract the pieces of information included into the narrative related to faithfulness and assumptions. The faithfulness to the original input table is then checked with downstream metrics such as rank or sign accuracy. For assumptions that were injected into the story by the LLM that cannot be checked against the original SHAP, downstream metrics like perplexity are explored. Finally, we explore an additional pathway for human similarity by embedding the generated narratives and comparing them to embeddings of human written narratives with metrics e.g. cosine similarity.
  • Figure 2: Left: A brief summary of the zero-shot prompt we use to generate the narratives. Right: Excerpt from a narrative generated with gpt-4o for the Fifa Man of the Match dataset. We highlight elements of the narrative that relate to faithfulness in blue and the assumptions injected by the model in red.
  • Figure 3: Left: A brief summary of the zero-shot prompt that we use to extract the narratives. Right: Excerpt from an extraction generated with gpt-4o for the narrative from Fig. \ref{['fig:GenerationPrompt']} formatted as a dictionary. The same elements highlighted in the Fig. \ref{['fig:GenerationPrompt']} can be found in the extracted dictionary in the same color code.
  • Figure 4: Overview tables for SHAP faithfulness, reflecting the differences between the extracted quantities and their ground truth for 20 narratives of the Fifa dataset generated with gpt-4o. Note that the absence of values on the most right panel indicates $\phi$ (meaning no value was identified in the narrative).
  • Figure 5: The increase in perplexity after manipulating the assumptions (sorted). For illustration we also provide the assumption together with its manipulation (red) for the most negative and most positive perplexity change.
  • ...and 3 more figures