TExplain: Explaining Learned Visual Features via Pre-trained (Frozen) Language Models
Saeid Asgari Taghanaki, Aliasghar Khani, Ali Saheb Pasand, Amir Khasahmadi, Aditya Sanghi, Karl D. D. Willis, Ali Mahdavi-Amiri
TL;DR
TExplain addresses the challenge of interpreting learned features in independently trained image classifiers by marrying a frozen image encoder with a pre-trained language model through a trainable translator. The method maps the classifier’s feature vector $Z$ to a language-model input $Z_{mapped}$ so that the decoder generates descriptive sentences, from which dominant words are extracted via nucleus sampling to form interpretable word clouds. Through faithfulness tests and human analyses, the approach reveals spurious correlations and dataset biases (e.g., Background Challenge cues, Waterbirds shifts) and demonstrates a practical mitigation workflow using GradCAM masking to improve worst-group accuracy while preserving overall performance. The work also validates latent-space alignment with Stable Diffusion and BLIP, reinforcing that the extracted textual explanations reflect meaningful visual features. Overall, TExplain offers a novel, adaptable framework for diagnosing and mitigating biases in image classifiers, with potential extensions to segmentation, autoencoders, and 3D data modalities.
Abstract
Interpreting the learned features of vision models has posed a longstanding challenge in the field of machine learning. To address this issue, we propose a novel method that leverages the capabilities of language models to interpret the learned features of pre-trained image classifiers. Our method, called TExplain, tackles this task by training a neural network to establish a connection between the feature space of image classifiers and language models. Then, during inference, our approach generates a vast number of sentences to explain the features learned by the classifier for a given image. These sentences are then used to extract the most frequent words, providing a comprehensive understanding of the learned features and patterns within the classifier. Our method, for the first time, utilizes these frequent words corresponding to a visual representation to provide insights into the decision-making process of the independently trained classifier, enabling the detection of spurious correlations, biases, and a deeper comprehension of its behavior. To validate the effectiveness of our approach, we conduct experiments on diverse datasets, including ImageNet-9L and Waterbirds. The results demonstrate the potential of our method to enhance the interpretability and robustness of image classifiers.
