Table of Contents
Fetching ...

TextCAM: Explaining Class Activation Map with Text

Qiming Zhao, Xingjian Li, Xiaoyu Cao, Xiaolong Wu, Min Xu

TL;DR

TextCAM addresses the lack of semantic explanations in CAM-based visual explanations by fusing CAM with CLIP-based semantic space to produce textual rationales. It derives per-channel semantic vectors using CLIP embeddings and Linear Discriminant Analysis, and then combines them with CAM weights to generate $T_c(x)$; sparsity and decorrelation regularizers select diverse, concise phrases. The approach can group saliency into concept-based groups via a greedy channel assignment, producing multiple text-annotated saliency maps. Experiments on ImageNet, CLEVR, CUB, and DomainNet show TextCAM yields faithful, interpretable explanations, enables debiasing interventions, and transfers to Vision Transformers. The method is training-free and architecture-agnostic, offering a practical tool to diagnose and trust vision models.

Abstract

Deep neural networks (DNNs) have achieved remarkable success across domains but remain difficult to interpret, limiting their trustworthiness in high-stakes applications. This paper focuses on deep vision models, for which a dominant line of explainability methods are Class Activation Mapping (CAM) and its variants working by highlighting spatial regions that drive predictions. We figure out that CAM provides little semantic insight into what attributes underlie these activations. To address this limitation, we propose TextCAM, a novel explanation framework that enriches CAM with natural languages. TextCAM combines the precise spatial localization of CAM with the semantic alignment of vision-language models (VLMs). Specifically, we derive channel-level semantic representations using CLIP embeddings and linear discriminant analysis, and aggregate them with CAM weights to produce textual descriptions of salient visual evidence. This yields explanations that jointly specify where the model attends and what visual attributes likely support its decision. We further extend TextCAM to generate feature channels into semantically coherent groups, enabling more fine-grained visual-textual explanations. Experiments on ImageNet, CLEVR, and CUB demonstrate that TextCAM produces faithful and interpretable rationales that improve human understanding, detect spurious correlations, and preserve model fidelity.

TextCAM: Explaining Class Activation Map with Text

TL;DR

TextCAM addresses the lack of semantic explanations in CAM-based visual explanations by fusing CAM with CLIP-based semantic space to produce textual rationales. It derives per-channel semantic vectors using CLIP embeddings and Linear Discriminant Analysis, and then combines them with CAM weights to generate ; sparsity and decorrelation regularizers select diverse, concise phrases. The approach can group saliency into concept-based groups via a greedy channel assignment, producing multiple text-annotated saliency maps. Experiments on ImageNet, CLEVR, CUB, and DomainNet show TextCAM yields faithful, interpretable explanations, enables debiasing interventions, and transfers to Vision Transformers. The method is training-free and architecture-agnostic, offering a practical tool to diagnose and trust vision models.

Abstract

Deep neural networks (DNNs) have achieved remarkable success across domains but remain difficult to interpret, limiting their trustworthiness in high-stakes applications. This paper focuses on deep vision models, for which a dominant line of explainability methods are Class Activation Mapping (CAM) and its variants working by highlighting spatial regions that drive predictions. We figure out that CAM provides little semantic insight into what attributes underlie these activations. To address this limitation, we propose TextCAM, a novel explanation framework that enriches CAM with natural languages. TextCAM combines the precise spatial localization of CAM with the semantic alignment of vision-language models (VLMs). Specifically, we derive channel-level semantic representations using CLIP embeddings and linear discriminant analysis, and aggregate them with CAM weights to produce textual descriptions of salient visual evidence. This yields explanations that jointly specify where the model attends and what visual attributes likely support its decision. We further extend TextCAM to generate feature channels into semantically coherent groups, enabling more fine-grained visual-textual explanations. Experiments on ImageNet, CLEVR, and CUB demonstrate that TextCAM produces faithful and interpretable rationales that improve human understanding, detect spurious correlations, and preserve model fidelity.

Paper Structure

This paper contains 29 sections, 10 equations, 10 figures.

Figures (10)

  • Figure 1: Overview of TextCAM. Left Bottom: Per-channel response pattern analysis with positive/negative samples. Per-channel representation is calculated by LDA in the image-text joint space of CLIP. Right Bottom: Calculating overall semantic representation by weights from CAM and selecting a diverse set of text explanations using sparse optimization. Right Top: Explaining saliency maps with top-K text explanations.
  • Figure 2: ImageNet results of TextCAM. Each column displays, respectively, the original image and the TextCAM results using Grad-CAM, Layer-CAM, Finer-CAM, and Eigen-CAM.
  • Figure 3: ImageNet results of TextCAM with saliency groups. Each column displays, respectively, the original image, Grad-CAM result, and grouped saliency maps along with their corresponding text explanations from the top-5 TextCAM results.
  • Figure 4: CLEVR qualitative results. Each row shows the input image, TextCAM for the shape head ($M_a$), and TextCAM for the color head ($M_b$). Top row: blue cube; bottom row: red ball. In both cases, heatmaps concentrate on the target object while remaining insensitive to the yellow cylinder distractor, indicating that the retrieved concepts (shape versus color) are supported by spatially localized evidence rather than incidental context.
  • Figure 5: TextCAM result statistics for ResNet50 models trained on clipart, real, sketch and quickdraw domains from the DomainNet dataset. Colorful curves represent the ratio of each attribute type appearing in the top-1 TextCAM results. Gray bars represent the approximate rank of embedding matrix formed by the top-1 TextCAM text embeddings from 1000 test examples.
  • ...and 5 more figures