Table of Contents
Fetching ...

The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models

Noah Y. Siegel, Oana-Maria Camburu, Nicolas Heess, Maria Perez-Ortiz

TL;DR

This paper tackles the problem of faithfulness in free-text explanations produced by large language models. It introduces Correlational Explanatory Faithfulness (CEF) to quantify how well explanations align with the impact of input interventions, and Correlational Counterfactual Test (CCT) as an instantiation of this metric on the Counterfactual Test (CT) using total variation distance to measure prediction shifts. By evaluating Llama-2 models on e-SNLI, ECQA, and ComVE with few-shot prompts and counterfactual perturbations, the authors show that CCT captures faithfulness trends that CT misses, and that faithfulness generally improves with model size but varies across datasets. The approach provides a more robust, quantitative framework for assessing the informativeness of explanations, with implications for oversight and safety in high-stakes AI applications. Future work includes expanding intervention types, exploring synonyms, and applying CCT to instruction-tuned models to further enhance evaluation of explanatory faithfulness.

Abstract

In order to oversee advanced AI systems, it is important to understand their underlying decision-making process. When prompted, large language models (LLMs) can provide natural language explanations or reasoning traces that sound plausible and receive high ratings from human annotators. However, it is unclear to what extent these explanations are faithful, i.e., truly capture the factors responsible for the model's predictions. In this work, we introduce Correlational Explanatory Faithfulness (CEF), a metric that can be used in faithfulness tests based on input interventions. Previous metrics used in such tests take into account only binary changes in the predictions. Our metric accounts for the total shift in the model's predicted label distribution, more accurately reflecting the explanations' faithfulness. We then introduce the Correlational Counterfactual Test (CCT) by instantiating CEF on the Counterfactual Test (CT) from Atanasova et al. (2023). We evaluate the faithfulness of free-text explanations generated by few-shot-prompted LLMs from the Llama2 family on three NLP tasks. We find that our metric measures aspects of faithfulness which the CT misses.

The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models

TL;DR

This paper tackles the problem of faithfulness in free-text explanations produced by large language models. It introduces Correlational Explanatory Faithfulness (CEF) to quantify how well explanations align with the impact of input interventions, and Correlational Counterfactual Test (CCT) as an instantiation of this metric on the Counterfactual Test (CT) using total variation distance to measure prediction shifts. By evaluating Llama-2 models on e-SNLI, ECQA, and ComVE with few-shot prompts and counterfactual perturbations, the authors show that CCT captures faithfulness trends that CT misses, and that faithfulness generally improves with model size but varies across datasets. The approach provides a more robust, quantitative framework for assessing the informativeness of explanations, with implications for oversight and safety in high-stakes AI applications. Future work includes expanding intervention types, exploring synonyms, and applying CCT to instruction-tuned models to further enhance evaluation of explanatory faithfulness.

Abstract

In order to oversee advanced AI systems, it is important to understand their underlying decision-making process. When prompted, large language models (LLMs) can provide natural language explanations or reasoning traces that sound plausible and receive high ratings from human annotators. However, it is unclear to what extent these explanations are faithful, i.e., truly capture the factors responsible for the model's predictions. In this work, we introduce Correlational Explanatory Faithfulness (CEF), a metric that can be used in faithfulness tests based on input interventions. Previous metrics used in such tests take into account only binary changes in the predictions. Our metric accounts for the total shift in the model's predicted label distribution, more accurately reflecting the explanations' faithfulness. We then introduce the Correlational Counterfactual Test (CCT) by instantiating CEF on the Counterfactual Test (CT) from Atanasova et al. (2023). We evaluate the faithfulness of free-text explanations generated by few-shot-prompted LLMs from the Llama2 family on three NLP tasks. We find that our metric measures aspects of faithfulness which the CT misses.
Paper Structure (22 sections, 3 equations, 2 figures, 6 tables)

This paper contains 22 sections, 3 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Intervention impactfulness vs. explanation mentions, PE. The plots show the fraction of examples where the explanation mentions the inserted text (IA) vs. the total variation distance (TVD) of the model's predictions before and after interventions. Rows show datasets, columns show models. Higher TVD indicates an intervention was more impactful on the model's prediction. See \ref{['fig:explanation_mentions_ep']} for results in the EP setting.
  • Figure 2: Intervention impactfulness vs. explanation mentions, EP. The plots show the fraction of examples where the explanation mentions the inserted text (IA) vs. the total variation distance (TVD) of the model's predictions before and after interventions: higher TVD indicates an intervention was more impactful on the model.