Table of Contents
Fetching ...

Case-Enhanced Vision Transformer: Improving Explanations of Image Similarity with a ViT-based Similarity Metric

Ziwei Zhao, David Leake, Xiaomeng Ye, David Crandall

TL;DR

CEViT introduces a ViT-based image similarity metric framed as a case-based reasoning approach. It forms a 2-channel input by concatenating the query with a reference image ($h\times w\times 2c$) and uses an MLP head to output a similarity score in $[0,1]$, enabling attention-based explanations. When integrated with a $k$-NN pipeline, CEViT achieves accuracy comparable to a ViT on MNIST ($99.0\%$ vs $99.1\%$) while preserving nearest-neighbor explanations and producing attention masks that highlight inter-class differences. The work points to future directions in harder datasets and counterfactual/semi-factual explanations to further enhance explainability and practical impact.

Abstract

This short paper presents preliminary research on the Case-Enhanced Vision Transformer (CEViT), a similarity measurement method aimed at improving the explainability of similarity assessments for image data. Initial experimental results suggest that integrating CEViT into k-Nearest Neighbor (k-NN) classification yields classification accuracy comparable to state-of-the-art computer vision models, while adding capabilities for illustrating differences between classes. CEViT explanations can be influenced by prior cases, to illustrate aspects of similarity relevant to those cases.

Case-Enhanced Vision Transformer: Improving Explanations of Image Similarity with a ViT-based Similarity Metric

TL;DR

CEViT introduces a ViT-based image similarity metric framed as a case-based reasoning approach. It forms a 2-channel input by concatenating the query with a reference image () and uses an MLP head to output a similarity score in , enabling attention-based explanations. When integrated with a -NN pipeline, CEViT achieves accuracy comparable to a ViT on MNIST ( vs ) while preserving nearest-neighbor explanations and producing attention masks that highlight inter-class differences. The work points to future directions in harder datasets and counterfactual/semi-factual explanations to further enhance explainability and practical impact.

Abstract

This short paper presents preliminary research on the Case-Enhanced Vision Transformer (CEViT), a similarity measurement method aimed at improving the explainability of similarity assessments for image data. Initial experimental results suggest that integrating CEViT into k-Nearest Neighbor (k-NN) classification yields classification accuracy comparable to state-of-the-art computer vision models, while adding capabilities for illustrating differences between classes. CEViT explanations can be influenced by prior cases, to illustrate aspects of similarity relevant to those cases.
Paper Structure (15 sections, 2 equations, 7 figures, 2 tables)

This paper contains 15 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Explaining class differences.
  • Figure 2: The patchify process
  • Figure 3: Given an input image (or image-like tensor), the patchify process divides it into $N^2$ smaller image patches (in this example, $N=4$). These image tokens, along with a classification token, are then fed into the transformer model to produce the output.
  • Figure 4: Accessing the attention mask between the classification token and M image tokens in a transformer model with L encoders.
  • Figure 5: The quantitative evaluation process. Following the patchify process, image patches from both the query image and the distractor image are merged based on the normalized masks. These hybrid tokens are used to compute updated class likelihoods.
  • ...and 2 more figures