How good is my story? Towards quantitative metrics for evaluating LLM-generated XAI narratives
Timour Ichmoukhamedov, James Hinns, David Martens
TL;DR
This work introduces an automated framework to evaluate LLM-generated XAI narratives derived from SHAP explanations for tabular binary classification. It couples a narrative-generation step with an extraction-based validation loop and downstream metrics—Rank/Sign/Value agreement for faithfulness, perplexity for assumptions, and embedding-based cosine similarity for human similarity—complemented by an embedding-based comparison to human-written narratives. Through experiments across multiple datasets and LLMs, the authors show that long prompts improve faithfulness, that manipulated narratives degrade metric performance, and that embedding-based similarity can rival or surpass BLEURT in matching human explanations. The study demonstrates scalable, automated validation of XAI narratives and highlights directions to refine metrics, particularly for assumption plausibility and task-specific embedding-based evaluations. Overall, it provides a practical pipeline for detecting hallucinations and guiding the development of more trustworthy narrative explanations.
Abstract
A rapidly developing application of LLMs in XAI is to convert quantitative explanations such as SHAP into user-friendly narratives to explain the decisions made by smaller prediction models. Evaluating the narratives without relying on human preference studies or surveys is becoming increasingly important in this field. In this work we propose a framework and explore several automated metrics to evaluate LLM-generated narratives for explanations of tabular classification tasks. We apply our approach to compare several state-of-the-art LLMs across different datasets and prompt types. As a demonstration of their utility, these metrics allow us to identify new challenges related to LLM hallucinations for XAI narratives.
