TExplain: Explaining Learned Visual Features via Pre-trained (Frozen) Language Models

Saeid Asgari Taghanaki; Aliasghar Khani; Ali Saheb Pasand; Amir Khasahmadi; Aditya Sanghi; Karl D. D. Willis; Ali Mahdavi-Amiri

TExplain: Explaining Learned Visual Features via Pre-trained (Frozen) Language Models

Saeid Asgari Taghanaki, Aliasghar Khani, Ali Saheb Pasand, Amir Khasahmadi, Aditya Sanghi, Karl D. D. Willis, Ali Mahdavi-Amiri

TL;DR

TExplain addresses the challenge of interpreting learned features in independently trained image classifiers by marrying a frozen image encoder with a pre-trained language model through a trainable translator. The method maps the classifier’s feature vector $Z$ to a language-model input $Z_{mapped}$ so that the decoder generates descriptive sentences, from which dominant words are extracted via nucleus sampling to form interpretable word clouds. Through faithfulness tests and human analyses, the approach reveals spurious correlations and dataset biases (e.g., Background Challenge cues, Waterbirds shifts) and demonstrates a practical mitigation workflow using GradCAM masking to improve worst-group accuracy while preserving overall performance. The work also validates latent-space alignment with Stable Diffusion and BLIP, reinforcing that the extracted textual explanations reflect meaningful visual features. Overall, TExplain offers a novel, adaptable framework for diagnosing and mitigating biases in image classifiers, with potential extensions to segmentation, autoencoders, and 3D data modalities.

Abstract

Interpreting the learned features of vision models has posed a longstanding challenge in the field of machine learning. To address this issue, we propose a novel method that leverages the capabilities of language models to interpret the learned features of pre-trained image classifiers. Our method, called TExplain, tackles this task by training a neural network to establish a connection between the feature space of image classifiers and language models. Then, during inference, our approach generates a vast number of sentences to explain the features learned by the classifier for a given image. These sentences are then used to extract the most frequent words, providing a comprehensive understanding of the learned features and patterns within the classifier. Our method, for the first time, utilizes these frequent words corresponding to a visual representation to provide insights into the decision-making process of the independently trained classifier, enabling the detection of spurious correlations, biases, and a deeper comprehension of its behavior. To validate the effectiveness of our approach, we conduct experiments on diverse datasets, including ImageNet-9L and Waterbirds. The results demonstrate the potential of our method to enhance the interpretability and robustness of image classifiers.

TExplain: Explaining Learned Visual Features via Pre-trained (Frozen) Language Models

TL;DR

to a language-model input

so that the decoder generates descriptive sentences, from which dominant words are extracted via nucleus sampling to form interpretable word clouds. Through faithfulness tests and human analyses, the approach reveals spurious correlations and dataset biases (e.g., Background Challenge cues, Waterbirds shifts) and demonstrates a practical mitigation workflow using GradCAM masking to improve worst-group accuracy while preserving overall performance. The work also validates latent-space alignment with Stable Diffusion and BLIP, reinforcing that the extracted textual explanations reflect meaningful visual features. Overall, TExplain offers a novel, adaptable framework for diagnosing and mitigating biases in image classifiers, with potential extensions to segmentation, autoencoders, and 3D data modalities.

Abstract

Paper Structure (15 sections, 7 figures, 2 tables)

This paper contains 15 sections, 7 figures, 2 tables.

Introduction
Method
Pre-trained Image Classifier
Training the Translator Network
Identifying Dominant Words by Sampling
Implementation details
Experiments
Faithfulness Test 1: Verifying that TExplain Picks up Relevant Features
Faithfulness Test 2: Human Analysis
Faithfulness Test 3: Verifying using Generative Model's Latent Space
Detecting Potential Shortcuts/Spurious Correlations
Related Work
Visual Heat Map-based Explanations.
Textual Explanation of Vision Models.
Conclusion

Figures (7)

Figure 1: TExplain projects learned visual representations of a frozen image classifier onto a space that an independently trained language model can interpret. Using a large number of generated sentence samples along with the visual representation, TExplain produces a word cloud for each visual representation. Blue and green refers to frozen and trainable parameters, respectively. The category of the feature representation is highlighted in red, while other captured features are shown in gray. The font size of each word indicates the strength of its corresponding feature.
Figure 2: Class predictions and corresponding word clouds generated by TExplain for both the original (top) and Only-BG-T (bottom) samples extracted from the Background Challenge dataset. The class predictions are displayed below each image.
Figure 3: Human analysis experiments were conducted to validate that TExplain generates meaningful explanations. The left side shows the count of relevant words, while the right side illustrates whether the most frequent word produced by TExplain is relevant to the classifier's top prediction.
Figure 4: Wordclouds based on image captioning (top) with BLIP and feature captioning (bottom) using TExplain for categories kitchen, bathroom, and bus.
Figure 5: Word clouds generated from TExplain explanations for the Background Challenge Dataset categories. Red represents the detected features related to the main category.
...and 2 more figures

TExplain: Explaining Learned Visual Features via Pre-trained (Frozen) Language Models

TL;DR

Abstract

TExplain: Explaining Learned Visual Features via Pre-trained (Frozen) Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)