Table of Contents
Fetching ...

Zero-Shot Textual Explanations via Translating Decision-Critical Features

Toshinori Yamauchi, Hiroshi Kera, Kazuhiko Kawamoto

TL;DR

TEXTER addresses the challenge of providing faithful, classifier-specific textual explanations in a zero-shot setting by isolating decision-critical features. It identifies contributing neurons via Integrated Gradients, visualizes their concepts, and uses a Sparse Autoencoder to obtain interpretable representations, which are then aligned with the CLIP vision space to ground textual explanations in a concept bank derived from LLMs and VLMs. The method yields more faithful explanations than global-feature-based zero-shot approaches and demonstrates robust performance across CNN and Transformer architectures, with improvements in interpretability and semantics-aligned explanations. This work advances interpretable vision systems by explaining what drives a model’s decision in natural language without retraining the original classifier.

Abstract

Textual explanations make image classifier decisions transparent by describing the prediction rationale in natural language. Large vision-language models can generate captions but are designed for general visual understanding, not classifier-specific reasoning. Existing zero-shot explanation methods align global image features with language, producing descriptions of what is visible rather than what drives the prediction. We propose TEXTER, which overcomes this limitation by isolating decision-critical features before alignment. TEXTER identifies the neurons contributing to the prediction and emphasizes the features encoded in those neurons -- i.e., the decision-critical features. It then maps these emphasized features into the CLIP feature space to retrieve textual explanations that reflect the model's reasoning. A sparse autoencoder further improves interpretability, particularly for Transformer architectures. Extensive experiments show that TEXTER generates more faithful and interpretable explanations than existing methods. The code will be publicly released.

Zero-Shot Textual Explanations via Translating Decision-Critical Features

TL;DR

TEXTER addresses the challenge of providing faithful, classifier-specific textual explanations in a zero-shot setting by isolating decision-critical features. It identifies contributing neurons via Integrated Gradients, visualizes their concepts, and uses a Sparse Autoencoder to obtain interpretable representations, which are then aligned with the CLIP vision space to ground textual explanations in a concept bank derived from LLMs and VLMs. The method yields more faithful explanations than global-feature-based zero-shot approaches and demonstrates robust performance across CNN and Transformer architectures, with improvements in interpretability and semantics-aligned explanations. This work advances interpretable vision systems by explaining what drives a model’s decision in natural language without retraining the original classifier.

Abstract

Textual explanations make image classifier decisions transparent by describing the prediction rationale in natural language. Large vision-language models can generate captions but are designed for general visual understanding, not classifier-specific reasoning. Existing zero-shot explanation methods align global image features with language, producing descriptions of what is visible rather than what drives the prediction. We propose TEXTER, which overcomes this limitation by isolating decision-critical features before alignment. TEXTER identifies the neurons contributing to the prediction and emphasizes the features encoded in those neurons -- i.e., the decision-critical features. It then maps these emphasized features into the CLIP feature space to retrieve textual explanations that reflect the model's reasoning. A sparse autoencoder further improves interpretability, particularly for Transformer architectures. Extensive experiments show that TEXTER generates more faithful and interpretable explanations than existing methods. The code will be publicly released.

Paper Structure

This paper contains 34 sections, 9 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Comparison between Text-To-Concept text_to_concept and the proposed TEXTER for explaining a cat prediction. Text-To-Concept, which relies on global image features, produces the explanation "cushions," describing dominant but irrelevant regions. In contrast, TEXTER isolates decision-critical features, such as whisker spots, through a concept image and translates it into the explanation "whisker spots," faithfully reflecting the model's rationale.
  • Figure 2:
  • Figure 3: Comparison of generated explanations between Text-To-Concept and the proposed method. The figure presents two cases: the prediction of person (top) and the prediction of TV monitor (bottom). For each result generated by TEXTER, the corresponding concept image is displayed.
  • Figure 4: Qualitative comparison of the textual explanations and concept images generated by the proposed method for the same prediction across ResNet-50, ViT, and DINO ViT-S/8. Each row corresponds to one input image and its predicted class.
  • Figure 5: Comparison of the generated explanations between Text-To-Concept and the proposed method for an input whose ground-truth label is water snake but is misclassified as stick insect. For the proposed method, concept images targeting each class are shown. All explanations are generated from a shared concept bank constructed as the union of those for stick insect and water snake.
  • ...and 3 more figures