Table of Contents
Fetching ...

Evaluating Readability and Faithfulness of Concept-based Explanations

Meng Li, Haoran Jin, Ruixuan Huang, Zhihao Xu, Defu Lian, Zijia Lin, Di Zhang, Xiting Wang

TL;DR

A formal definition of concepts is introduced generalizing to diverse concept-based explanations’ settings, and the faithfulness of a concept explanation via perturbation is quantified via perturbation, ensuring adequate perturbation in the high-dimensional space for different concepts via an optimization problem.

Abstract

With the growing popularity of general-purpose Large Language Models (LLMs), comes a need for more global explanations of model behaviors. Concept-based explanations arise as a promising avenue for explaining high-level patterns learned by LLMs. Yet their evaluation poses unique challenges, especially due to their non-local nature and high dimensional representation in a model's hidden space. Current methods approach concepts from different perspectives, lacking a unified formalization. This makes evaluating the core measures of concepts, namely faithfulness or readability, challenging. To bridge the gap, we introduce a formal definition of concepts generalizing to diverse concept-based explanations' settings. Based on this, we quantify the faithfulness of a concept explanation via perturbation. We ensure adequate perturbation in the high-dimensional space for different concepts via an optimization problem. Readability is approximated via an automatic and deterministic measure, quantifying the coherence of patterns that maximally activate a concept while aligning with human understanding. Finally, based on measurement theory, we apply a meta-evaluation method for evaluating these measures, generalizable to other types of explanations or tasks as well. Extensive experimental analysis has been conducted to inform the selection of explanation evaluation measures.

Evaluating Readability and Faithfulness of Concept-based Explanations

TL;DR

A formal definition of concepts is introduced generalizing to diverse concept-based explanations’ settings, and the faithfulness of a concept explanation via perturbation is quantified via perturbation, ensuring adequate perturbation in the high-dimensional space for different concepts via an optimization problem.

Abstract

With the growing popularity of general-purpose Large Language Models (LLMs), comes a need for more global explanations of model behaviors. Concept-based explanations arise as a promising avenue for explaining high-level patterns learned by LLMs. Yet their evaluation poses unique challenges, especially due to their non-local nature and high dimensional representation in a model's hidden space. Current methods approach concepts from different perspectives, lacking a unified formalization. This makes evaluating the core measures of concepts, namely faithfulness or readability, challenging. To bridge the gap, we introduce a formal definition of concepts generalizing to diverse concept-based explanations' settings. Based on this, we quantify the faithfulness of a concept explanation via perturbation. We ensure adequate perturbation in the high-dimensional space for different concepts via an optimization problem. Readability is approximated via an automatic and deterministic measure, quantifying the coherence of patterns that maximally activate a concept while aligning with human understanding. Finally, based on measurement theory, we apply a meta-evaluation method for evaluating these measures, generalizable to other types of explanations or tasks as well. Extensive experimental analysis has been conducted to inform the selection of explanation evaluation measures.
Paper Structure (22 sections, 18 equations, 9 figures, 6 tables)

This paper contains 22 sections, 18 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: The overall framework. (a) Concept extraction: We formalize concepts as virtual neurons. (b) Evaluation is approached via readability and faithfulness. Readability is approximated by the semantic similarity of patterns that maximally activate the concept. Faithfulness is approximated by the difference in output when a concept is perturbed. (c) Meta-Evaluation is performed on the observed results of proposed measures via reliability and validity.
  • Figure 2: Estimated test-retest reliability and subset consistency of the proposed measures. The red dashed line indicates the minimal standard of 0.9 nunnally1994psychometric.
  • Figure 3: The MTMM table of the evaluation measures: 1) subset consistency is shown on the diagonals; 2) construct validity is displayed on the off-diagonals.
  • Figure 4: Performance of different baselines on representative measures.
  • Figure 5: Taxonomy of prior automatic metrics on concept-based explanation methods.
  • ...and 4 more figures