Table of Contents
Fetching ...

Inpainting the Gaps: A Novel Framework for Evaluating Explanation Methods in Vision Transformers

Lokesh Badisa, Sumohana S. Channappayya

TL;DR

Pixel-masking explanations for Vision Transformers introduce test-time distribution shifts that bias evaluation. InG (Inpainting the Gaps) perturbs parts via inpainting on real images, reducing distribution shift and enabling meaningful, part-level evaluation of ViT explanation methods on PartImageNet. The framework is model-agnostic, does not require retraining, and, using MI-GAN inpainting and OTDD for distribution comparison, yields higher and more consistent scores across ViT variants, with GA and BI among strong explanations. Overall, InG provides a practical, semi-synthetic approach to evaluating ViT explanations that better reflects real-world conditions.

Abstract

The perturbation test remains the go-to evaluation approach for explanation methods in computer vision. This evaluation method has a major drawback of test-time distribution shift due to pixel-masking that is not present in the training set. To overcome this drawback, we propose a novel evaluation framework called \textbf{Inpainting the Gaps (InG)}. Specifically, we propose inpainting parts that constitute partial or complete objects in an image. In this way, one can perform meaningful image perturbations with lower test-time distribution shifts, thereby improving the efficacy of the perturbation test. InG is applied to the PartImageNet dataset to evaluate the performance of popular explanation methods for three training strategies of the Vision Transformer (ViT). Based on this evaluation, we found Beyond Intuition and Generic Attribution to be the two most consistent explanation models. Further, and interestingly, the proposed framework results in higher and more consistent evaluation scores across all the ViT models considered in this work. To the best of our knowledge, InG is the first semi-synthetic framework for the evaluation of ViT explanation methods.

Inpainting the Gaps: A Novel Framework for Evaluating Explanation Methods in Vision Transformers

TL;DR

Pixel-masking explanations for Vision Transformers introduce test-time distribution shifts that bias evaluation. InG (Inpainting the Gaps) perturbs parts via inpainting on real images, reducing distribution shift and enabling meaningful, part-level evaluation of ViT explanation methods on PartImageNet. The framework is model-agnostic, does not require retraining, and, using MI-GAN inpainting and OTDD for distribution comparison, yields higher and more consistent scores across ViT variants, with GA and BI among strong explanations. Overall, InG provides a practical, semi-synthetic approach to evaluating ViT explanations that better reflects real-world conditions.

Abstract

The perturbation test remains the go-to evaluation approach for explanation methods in computer vision. This evaluation method has a major drawback of test-time distribution shift due to pixel-masking that is not present in the training set. To overcome this drawback, we propose a novel evaluation framework called \textbf{Inpainting the Gaps (InG)}. Specifically, we propose inpainting parts that constitute partial or complete objects in an image. In this way, one can perform meaningful image perturbations with lower test-time distribution shifts, thereby improving the efficacy of the perturbation test. InG is applied to the PartImageNet dataset to evaluate the performance of popular explanation methods for three training strategies of the Vision Transformer (ViT). Based on this evaluation, we found Beyond Intuition and Generic Attribution to be the two most consistent explanation models. Further, and interestingly, the proposed framework results in higher and more consistent evaluation scores across all the ViT models considered in this work. To the best of our knowledge, InG is the first semi-synthetic framework for the evaluation of ViT explanation methods.
Paper Structure (22 sections, 3 equations, 10 figures, 3 tables)

This paper contains 22 sections, 3 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Pipeline of our method. Though the figure shows only single part perturbation, our framework uses multiple-part perturbation. We skipped showing multiple-part perturbation in interest of space.
  • Figure 2: A qualitative example for of inpainting models. The top-left image is inpainted in the regions identified by the masks in the left column. The label of each inpainted image identifies the inpainting model.
  • Figure 3: An illustrative example of image generation using the MI-GAN Sargsyan_2023_ICCV model. The top row shows the masks, and the bottom row shows inpainted images. This example shows all possible part removals. Masked regions have blended with the background, resulting in realistic part removal.
  • Figure 4: A qualitative comparison of masking and inpainting-based evaluation of the Beyond Intuition-Head (BI) method applied to ViT-Base. The first and third columns are the masked and inpainted images, respectively. The second and fourth columns are the attention maps generated by the BI model. Explanation maps for the masked images show undesired feeble attribution to the masked regions. This is clear from the last row, where the attribution of the masked image includes the masked regions. However, this attribution is much lower in the inpainted images.
  • Figure 5: A qualitative example of how explanation methods work on masked images. Explanations are generated with Beyond Intuition-Head. The first two columns are mask and masked images, respectively. The remaining columns are the explanation maps generated by models, which are mentioned as the title of the column.
  • ...and 5 more figures