Table of Contents
Fetching ...

Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods

Mingqi Jiang, Saeed Khorram, Li Fuxin

TL;DR

This work tackles the interpretability gap in comparing Transformers and CNNs by introducing dataset-wide explanation statistics. It presents two methods—sub-explanation counting and cross-testing—to quantify compositionality versus disjunctivism and to map the feature-use landscape across architectures. Key findings include a strong influence of normalization (batch vs layer/group) on compositionality, with Transformers and ConvNeXt tending toward more compositional behavior than CNNs, and distillation shaping transformer explanations toward CNN-like patterns. These insights advance understanding of deep visual models and suggest ensemble opportunities leveraging diverse feature-use strategies.

Abstract

In order to gain insights about the decision-making of different visual recognition backbones, we propose two methodologies, sub-explanation counting and cross-testing, that systematically applies deep explanation algorithms on a dataset-wide basis, and compares the statistics generated from the amount and nature of the explanations. These methodologies reveal the difference among networks in terms of two properties called compositionality and disjunctivism. Transformers and ConvNeXt are found to be more compositional, in the sense that they jointly consider multiple parts of the image in building their decisions, whereas traditional CNNs and distilled transformers are less compositional and more disjunctive, which means that they use multiple diverse but smaller set of parts to achieve a confident prediction. Through further experiments, we pinpointed the choice of normalization to be especially important in the compositionality of a model, in that batch normalization leads to less compositionality while group and layer normalization lead to more. Finally, we also analyze the features shared by different backbones and plot a landscape of different models based on their feature-use similarity.

Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods

TL;DR

This work tackles the interpretability gap in comparing Transformers and CNNs by introducing dataset-wide explanation statistics. It presents two methods—sub-explanation counting and cross-testing—to quantify compositionality versus disjunctivism and to map the feature-use landscape across architectures. Key findings include a strong influence of normalization (batch vs layer/group) on compositionality, with Transformers and ConvNeXt tending toward more compositional behavior than CNNs, and distillation shaping transformer explanations toward CNN-like patterns. These insights advance understanding of deep visual models and suggest ensemble opportunities leveraging diverse feature-use strategies.

Abstract

In order to gain insights about the decision-making of different visual recognition backbones, we propose two methodologies, sub-explanation counting and cross-testing, that systematically applies deep explanation algorithms on a dataset-wide basis, and compares the statistics generated from the amount and nature of the explanations. These methodologies reveal the difference among networks in terms of two properties called compositionality and disjunctivism. Transformers and ConvNeXt are found to be more compositional, in the sense that they jointly consider multiple parts of the image in building their decisions, whereas traditional CNNs and distilled transformers are less compositional and more disjunctive, which means that they use multiple diverse but smaller set of parts to achieve a confident prediction. Through further experiments, we pinpointed the choice of normalization to be especially important in the compositionality of a model, in that batch normalization leads to less compositionality while group and layer normalization lead to more. Finally, we also analyze the features shared by different backbones and plot a landscape of different models based on their feature-use similarity.
Paper Structure (21 sections, 2 equations, 37 figures, 12 tables)

This paper contains 21 sections, 2 equations, 37 figures, 12 tables.

Figures (37)

  • Figure 1: Different behaviors exhibited by different classes of models. Likelihood ratio refers to the ratio between the predicted class-conditional probability of the target category from the masked image and the full image. With the compositional behavior, a confident classification is built up jointly from multiple parts, removing some parts may only slightly reduce the likelihood ratio (shown below each node in the tree in the top-right part of the figure). With the disjunctive behavior, the network requires very few parts to obtain a highly confident prediction (sometimes more confident than the full image), but it can rely on any of multiple diverse combinations to obtain a confident prediction, similar to a logical OR among the different conjunctions (Best viewed in color)
  • Figure 2: Illustration of Minimal Sufficient Explanations (MSEs) and sub-explanations. MSEs are minimally masked images that the deep network would recognize as the same category as the full image, with its predicted class-conditional probability at least $90\%$ w.r.t. the one from the full image. Sub-explanations are defined as a subset of the patches of an MSE (Best Viewed in Color)
  • Figure 3: Cross-testing the Insertion metric between VGG-19 and Swin-T for "hummingbird". (Top) Insertion images are obtained by successively revealing pixels that are deemed salient by the heatmap; (Bottom) The Area Under the Curves (AUC) are used to compute the insertion metric for each classifier, when heatmaps are generated from only one of them (Best Viewed in Color)
  • Figure 4: MSEs and some sub-explanations of different models on an image of the Indigo Bird class. Due to the space limit we only subsampled a few subexplanations. The removed patch from the parent node is indicated with a red outline. (Best viewed in Color)
  • Figure 5: Sorted average values of the maximal activation in each image for each channel in the last block for ConvNeXt-T variants
  • ...and 32 more figures