Table of Contents
Fetching ...

Learning by Self-Explaining

Wolfgang Stammer, Felix Friedrich, David Steinmann, Manuel Brack, Hikaru Shindo, Kristian Kersting

TL;DR

This work provides evidence for the potential of self-explaining within the learning phase of an AI model, in terms of model generalization, reducing the influence of confounding factors, and providing more task-relevant and faithful model explanations.

Abstract

Much of explainable AI research treats explanations as a means for model inspection. Yet, this neglects findings from human psychology that describe the benefit of self-explanations in an agent's learning process. Motivated by this, we introduce a novel workflow in the context of image classification, termed Learning by Self-Explaining (LSX). LSX utilizes aspects of self-refining AI and human-guided explanatory machine learning. The underlying idea is that a learner model, in addition to optimizing for the original predictive task, is further optimized based on explanatory feedback from an internal critic model. Intuitively, a learner's explanations are considered "useful" if the internal critic can perform the same task given these explanations. We provide an overview of important components of LSX and, based on this, perform extensive experimental evaluations via three different example instantiations. Our results indicate improvements via Learning by Self-Explaining on several levels: in terms of model generalization, reducing the influence of confounding factors, and providing more task-relevant and faithful model explanations. Overall, our work provides evidence for the potential of self-explaining within the learning phase of an AI model.

Learning by Self-Explaining

TL;DR

This work provides evidence for the potential of self-explaining within the learning phase of an AI model, in terms of model generalization, reducing the influence of confounding factors, and providing more task-relevant and faithful model explanations.

Abstract

Much of explainable AI research treats explanations as a means for model inspection. Yet, this neglects findings from human psychology that describe the benefit of self-explanations in an agent's learning process. Motivated by this, we introduce a novel workflow in the context of image classification, termed Learning by Self-Explaining (LSX). LSX utilizes aspects of self-refining AI and human-guided explanatory machine learning. The underlying idea is that a learner model, in addition to optimizing for the original predictive task, is further optimized based on explanatory feedback from an internal critic model. Intuitively, a learner's explanations are considered "useful" if the internal critic can perform the same task given these explanations. We provide an overview of important components of LSX and, based on this, perform extensive experimental evaluations via three different example instantiations. Our results indicate improvements via Learning by Self-Explaining on several levels: in terms of model generalization, reducing the influence of confounding factors, and providing more task-relevant and faithful model explanations. Overall, our work provides evidence for the potential of self-explaining within the learning phase of an AI model.
Paper Structure (20 sections, 11 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 20 sections, 11 equations, 11 figures, 9 tables, 1 algorithm.

Figures (11)

  • Figure 1: (left) Current (self-)refinement machine learning utilizes (I.) forms of self-supervised model refinement (e.g., self-rewarding). On the other hand, it also relies on (II.) explanatory interactive learning (XIL) which utilizes human feedback via explanations for model refinement. (right) In contrast, we introduce Learning by Self-Explaining which integrates ideas from both research fields into one approach as explanation-based self-refinement (I + II.). A model in LSX consists of two submodels, a learner and (internal) critic, and performs refinement via four modules ($\textsc{Fit}$, $\textsc{Explain}$, $\textsc{Reflect}$, $\textsc{Revise}$, cf. Alg. \ref{['alg:lsx']}). The learner is optimized for a base task in $\textsc{Fit}$ (e.g., image classification), after which it provides explanations to its decisions in $\textsc{Explain}$. In the $\textsc{Reflect}$ module, the critic assesses how useful the explanations are for performing the base task. The resulting feedback from the critic is used to $\textsc{Revise}$ the learner.
  • Figure 2: Exemplary explanations on MNIST from CNN baseline vs. CNN-LSX. Four random explanations are shown per image class (class ids on sides).
  • Figure 3: CNN-LSX: Learning by Self-Explaining instantiation for training CNNs for supervised image classification. Here CNNs represent both the learner and critic. Explanations are generated via InputXGradient. The feedback represents the classification loss of the critic on these explanations.
  • Figure 4: NeSy-LSX: Learning by Self-Explaining instantiation for supervised image classification via neuro-symbolic concept learner. The learner proposes a set of candidate class-specific logical explanations. The critic represents a neuro-symbolic forward reasoner, which computes the validity of these logical statements given visual input. The feedback represents a probabilistic ranking of the set of logical explanations with which we identify the most likely explanation per image class and revise the learner to only use this explanation for samples of that class.
  • Figure 5: VLM-LSX: Learning by Self-Explaining instantiation for visual question answering via a vision-language model. The learner proposes a set of candidate explanations via explanation prompting. The critic represents a pre-trained language model, which provides preference scores over the generated explanations. The feedback represents a binary ranking of the set of explanations with which we identify "good" explanations and revise the learner for predicting these explanations.
  • ...and 6 more figures