Table of Contents
Fetching ...

Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer

Junyi Wu, Bin Duan, Weitai Kang, Hao Tang, Yan Yan

TL;DR

Vision Transformer explanations often neglect the influence of token transformations, risking misleading rationales. TokenTM introduces a token transformation measurement based on token length changes and directional correlation (NECC) and integrates it with attention via a layer-wise aggregation framework to produce a faithful contribution map. The method yields object-centric explanations and outperforms baselines on segmentation localization and perturbation robustness, with ablations validating the components. This approach advances post-hoc interpretability for Vision Transformers by capturing the cumulative effects of token transformations across layers.

Abstract

While Transformers have rapidly gained popularity in various computer vision applications, post-hoc explanations of their internal mechanisms remain largely unexplored. Vision Transformers extract visual information by representing image regions as transformed tokens and integrating them via attention weights. However, existing post-hoc explanation methods merely consider these attention weights, neglecting crucial information from the transformed tokens, which fails to accurately illustrate the rationales behind the models' predictions. To incorporate the influence of token transformation into interpretation, we propose TokenTM, a novel post-hoc explanation method that utilizes our introduced measurement of token transformation effects. Specifically, we quantify token transformation effects by measuring changes in token lengths and correlations in their directions pre- and post-transformation. Moreover, we develop initialization and aggregation rules to integrate both attention weights and token transformation effects across all layers, capturing holistic token contributions throughout the model. Experimental results on segmentation and perturbation tests demonstrate the superiority of our proposed TokenTM compared to state-of-the-art Vision Transformer explanation methods.

Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer

TL;DR

Vision Transformer explanations often neglect the influence of token transformations, risking misleading rationales. TokenTM introduces a token transformation measurement based on token length changes and directional correlation (NECC) and integrates it with attention via a layer-wise aggregation framework to produce a faithful contribution map. The method yields object-centric explanations and outperforms baselines on segmentation localization and perturbation robustness, with ablations validating the components. This approach advances post-hoc interpretability for Vision Transformers by capturing the cumulative effects of token transformations across layers.

Abstract

While Transformers have rapidly gained popularity in various computer vision applications, post-hoc explanations of their internal mechanisms remain largely unexplored. Vision Transformers extract visual information by representing image regions as transformed tokens and integrating them via attention weights. However, existing post-hoc explanation methods merely consider these attention weights, neglecting crucial information from the transformed tokens, which fails to accurately illustrate the rationales behind the models' predictions. To incorporate the influence of token transformation into interpretation, we propose TokenTM, a novel post-hoc explanation method that utilizes our introduced measurement of token transformation effects. Specifically, we quantify token transformation effects by measuring changes in token lengths and correlations in their directions pre- and post-transformation. Moreover, we develop initialization and aggregation rules to integrate both attention weights and token transformation effects across all layers, capturing holistic token contributions throughout the model. Experimental results on segmentation and perturbation tests demonstrate the superiority of our proposed TokenTM compared to state-of-the-art Vision Transformer explanation methods.
Paper Structure (18 sections, 20 equations, 5 figures, 7 tables)

This paper contains 18 sections, 20 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Visualization of attention and transformation weights, and the result of our TokenTM that integrates both of them. Circle sizes signify weight magnitudes or token lengths, and arrows indicate directions. Transformation weights are derived by our proposed measurement, which evaluates the transformation effects by gauging changes in length and direction. Both weights are visualized by heatmaps. Solely using attention weights often fails to localize foreground objects and inaccurately highlights noisy backgrounds as rationales. In contrast, leveraging additional information from transformation, our method produces object-centric post-hoc interpretations.
  • Figure 2: Illustration of our token transformation measurement. We depict original and transformed tokens with circles and arrows. Circle sizes reflect lengths, and arrows denote directions. The effects of token transformation are reflected by the changes in length and direction. Our method considers both properties to evaluate these effects, resulting in the corresponding transformation weights.
  • Figure 3: Illustration of our aggregation framework and the explanation pipeline. The overall contribution map is initialized by input token lengths and is updated using our $\mathbf{U}^l$ to trace token evolution across layers. In $\mathbf{C}^{l}$, each $i$-th row represents the influences of input tokens $\mathbf{E}^0$ on the output of the $l$-th layer $\mathbf{E}^l$. For $\mathbf{C}^{n_L}$, the row w.r.t.$[CLS]$ token is extracted and reshaped to produce the final explanation map.
  • Figure 4: Visualizations of post-hoc explanation heatmaps. Our method captures more object-centric regions as the rationales.
  • Figure 5: Visualizations of the localization process using our TokenTM. As more layers are incorporated into the aggregation, the heatmaps become increasingly focused, particularly in the regions surrounding the object recognized by the model.