Table of Contents
Fetching ...

Explingo: Explaining AI Predictions using Large Language Models

Alexandra Zytek, Sara Pido, Sarah Alnegheimish, Laure Berti-Equille, Kalyan Veeramachaneni

TL;DR

The paper addresses the challenge of turning explainable AI outputs into human-readable narratives by introducing Explingo, a two-component system with a Narrator that converts SHAP explanations into narratives and a Grader that automatically evaluates narrative quality across accuracy, completeness, fluency, and conciseness. Results show that LLMs can generate high-quality narratives when guided by a small set of hand-written and bootstrapped exemplars, with the Grader providing automated, scalable evaluation via a weighted score $G = \alpha_a A + \alpha_f F + \alpha_c C + \alpha_s S$. The work provides an open-source implementation within Pyreal and nine exemplar datasets to support tuning and evaluation, highlighting the trade-offs between exemplar quantity and narrative fidelity. This approach enables safer, more usable narrative explanations and lays the groundwork for interactive, natural-language ML explanations in real-world decision-making.

Abstract

Explanations of machine learning (ML) model predictions generated by Explainable AI (XAI) techniques such as SHAP are essential for people using ML outputs for decision-making. We explore the potential of Large Language Models (LLMs) to transform these explanations into human-readable, narrative formats that align with natural communication. We address two key research questions: (1) Can LLMs reliably transform traditional explanations into high-quality narratives? and (2) How can we effectively evaluate the quality of narrative explanations? To answer these questions, we introduce Explingo, which consists of two LLM-based subsystems, a Narrator and Grader. The Narrator takes in ML explanations and transforms them into natural-language descriptions. The Grader scores these narratives on a set of metrics including accuracy, completeness, fluency, and conciseness. Our experiments demonstrate that LLMs can generate high-quality narratives that achieve high scores across all metrics, particularly when guided by a small number of human-labeled and bootstrapped examples. We also identified areas that remain challenging, in particular for effectively scoring narratives in complex domains. The findings from this work have been integrated into an open-source tool that makes narrative explanations available for further applications.

Explingo: Explaining AI Predictions using Large Language Models

TL;DR

The paper addresses the challenge of turning explainable AI outputs into human-readable narratives by introducing Explingo, a two-component system with a Narrator that converts SHAP explanations into narratives and a Grader that automatically evaluates narrative quality across accuracy, completeness, fluency, and conciseness. Results show that LLMs can generate high-quality narratives when guided by a small set of hand-written and bootstrapped exemplars, with the Grader providing automated, scalable evaluation via a weighted score . The work provides an open-source implementation within Pyreal and nine exemplar datasets to support tuning and evaluation, highlighting the trade-offs between exemplar quantity and narrative fidelity. This approach enables safer, more usable narrative explanations and lays the groundwork for interactive, natural-language ML explanations in real-world decision-making.

Abstract

Explanations of machine learning (ML) model predictions generated by Explainable AI (XAI) techniques such as SHAP are essential for people using ML outputs for decision-making. We explore the potential of Large Language Models (LLMs) to transform these explanations into human-readable, narrative formats that align with natural communication. We address two key research questions: (1) Can LLMs reliably transform traditional explanations into high-quality narratives? and (2) How can we effectively evaluate the quality of narrative explanations? To answer these questions, we introduce Explingo, which consists of two LLM-based subsystems, a Narrator and Grader. The Narrator takes in ML explanations and transforms them into natural-language descriptions. The Grader scores these narratives on a set of metrics including accuracy, completeness, fluency, and conciseness. Our experiments demonstrate that LLMs can generate high-quality narratives that achieve high scores across all metrics, particularly when guided by a small number of human-labeled and bootstrapped examples. We also identified areas that remain challenging, in particular for effectively scoring narratives in complex domains. The findings from this work have been integrated into an open-source tool that makes narrative explanations available for further applications.

Paper Structure

This paper contains 19 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Sample Narrator inputs and outputs. The items in blue make up the prompt passed to the Narrator to transform ML explanations into narratives for a house-pricing example. The item in green is a narrative the Narrator LLM may generate based on this prompt. Some components, like the output instructions, are provided by DSPy khattab2023dspycompilingdeclarativelanguage.
  • Figure 2: Prompt passed to the Grader to compute the accuracy metric. Experiments suggested that asking for a grade from a 2-point rubric resulted in better performance than a simple yes/no question. We saw that the Grader incorrectly considered narratives that were accurate but did not include all feature values from the input explanation as inaccurate, and regularly did not notice when contribution directions were wrong if the values were correct. We added explicit instructions to the prompt for these two scenarios accordingly. Finally, we determined that using a chain-of-thought prompting approach further improved the Grader's ability to correctly score accuracy.
  • Figure 3: Prompt passed to the Grader to compute the completeness metric. We found that while the original version of the prompt allowed the LLM to correctly identify missing feature values, it was regularly not identifying when features were missing altogether or did not have their feature directions listed. Through experimentation, we determined that a chain-of-thought wei_chain--thought_2023 approach that explicitly asked the Grader to list out and consider each feature in the explanation one-by-one addressed this issue well.
  • Figure 4: Prompt passed to the Grader to compute the fluency metric. We found that the Grader was over-emphasizing narrative content, as shown by a significant difference in score between narratives from different datasets with original prompt (How well does the narrative match the style of the example narratives?). We therefore added a statement to explicitly ignore topic. Our experiments determined that there were diminishing returns to effectiveness after 5 exemplar narratives.
  • Figure 5: Difference between fluency scores on narratives compared to exemplars from their own datasets (blue) and those compared against other datasets (orange). We expect to see significantly higher fluency scores for the former compared to the latter. We see that adding more exemplars increases this difference, with diminishing returns after 5 exemplars.
  • ...and 2 more figures