Table of Contents
Fetching ...

Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

Christos Fragkathoulas, Odysseas S. Chlapanis

TL;DR

The paper tackles the challenge of evaluating faithfulness in black-box LLM explanations by proposing a locally perturbation-based, leave-one-out–inspired method that identifies sufficient context regions and necessary keywords. A structured prompting scheme elicits self-explanations, which are then quantitatively compared to the identified informative parts using a hybrid QA metric and a formal faithfulness score. Experiments on Natural Questions with Retrieval-Hard subsets demonstrate the approach’s feasibility, cost-efficiency, and that newer models (e.g., GPT-4o) yield higher faithfulness than earlier ones (e.g., GPT-3.5). The work contributes a practical framework for measuring faithfulness in proprietary LLMs and provides a foundation for more faithful AI explanations in real-world QA tasks.

Abstract

This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.

Local Explanations and Self-Explanations for Assessing Faithfulness in black-box LLMs

TL;DR

The paper tackles the challenge of evaluating faithfulness in black-box LLM explanations by proposing a locally perturbation-based, leave-one-out–inspired method that identifies sufficient context regions and necessary keywords. A structured prompting scheme elicits self-explanations, which are then quantitatively compared to the identified informative parts using a hybrid QA metric and a formal faithfulness score. Experiments on Natural Questions with Retrieval-Hard subsets demonstrate the approach’s feasibility, cost-efficiency, and that newer models (e.g., GPT-4o) yield higher faithfulness than earlier ones (e.g., GPT-3.5). The work contributes a practical framework for measuring faithfulness in proprietary LLMs and provides a foundation for more faithful AI explanations in real-world QA tasks.

Abstract

This paper introduces a novel task to assess the faithfulness of large language models (LLMs) using local perturbations and self-explanations. Many LLMs often require additional context to answer certain questions correctly. For this purpose, we propose a new efficient alternative explainability technique, inspired by the commonly used leave-one-out approach. Using this approach, we identify the sufficient and necessary parts for the LLM to generate correct answers, serving as explanations. We propose a metric for assessing faithfulness that compares these crucial parts with the self-explanations of the model. Using the Natural Questions dataset, we validate our approach, demonstrating its effectiveness in explaining model decisions and assessing faithfulness.
Paper Structure (15 sections, 7 equations, 1 table, 2 algorithms)