LExT: Towards Evaluating Trustworthiness of Natural Language Explanations

Krithi Shailya; Shreya Rajpal; Gokul S Krishnan; Balaraman Ravindran

LExT: Towards Evaluating Trustworthiness of Natural Language Explanations

Krithi Shailya, Shreya Rajpal, Gokul S Krishnan, Balaraman Ravindran

TL;DR

The paper introduces Language Explanation Trustworthiness (LExT), a framework that jointly assesses Plausibility and Faithfulness to evaluate the trustworthiness of LLM-generated explanations in high-stakes domains like healthcare. It defines a suite of metrics, including Correctness, Consistency, QAG Score, Counterfactual Stability, and Contextual Faithfulness, and derives the final LExT score as the harmonic mean of Plausibility and Faithfulness. Applied to QPain and PubMedQA with six models (domain-specific and general-purpose), LExT reveals meaningful differences in explanation quality and robustness, highlighting trade-offs between plausibility and faithfulness across models. The framework offers a principled, generalizable approach to improve transparency and reliability of medical AI explanations, with potential applicability to other domains such as law and finance.

Abstract

As Large Language Models (LLMs) become increasingly integrated into high-stakes domains, there have been several approaches proposed toward generating natural language explanations. These explanations are crucial for enhancing the interpretability of a model, especially in sensitive domains like healthcare, where transparency and reliability are key. In light of such explanations being generated by LLMs and its known concerns, there is a growing need for robust evaluation frameworks to assess model-generated explanations. Natural Language Generation metrics like BLEU and ROUGE capture syntactic and semantic accuracies but overlook other crucial aspects such as factual accuracy, consistency, and faithfulness. To address this gap, we propose a general framework for quantifying trustworthiness of natural language explanations, balancing Plausibility and Faithfulness, to derive a comprehensive Language Explanation Trustworthiness Score (LExT) (The code and set up to reproduce our experiments are publicly available at https://github.com/cerai-iitm/LExT). Applying our domain-agnostic framework to the healthcare domain using public medical datasets, we evaluate six models, including domain-specific and general-purpose models. Our findings demonstrate significant differences in their ability to generate trustworthy explanations. On comparing these explanations, we make interesting observations such as inconsistencies in Faithfulness demonstrated by general-purpose models and their tendency to outperform domain-specific fine-tuned models. This work further highlights the importance of using a tailored evaluation framework to assess natural language explanations in sensitive fields, providing a foundation for improving the trustworthiness and transparency of language models in healthcare and beyond.

LExT: Towards Evaluating Trustworthiness of Natural Language Explanations

TL;DR

Abstract

LExT: Towards Evaluating Trustworthiness of Natural Language Explanations

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)