Case-Enhanced Vision Transformer: Improving Explanations of Image Similarity with a ViT-based Similarity Metric

Ziwei Zhao; David Leake; Xiaomeng Ye; David Crandall

Case-Enhanced Vision Transformer: Improving Explanations of Image Similarity with a ViT-based Similarity Metric

Ziwei Zhao, David Leake, Xiaomeng Ye, David Crandall

TL;DR

CEViT introduces a ViT-based image similarity metric framed as a case-based reasoning approach. It forms a 2-channel input by concatenating the query with a reference image ($h\times w\times 2c$) and uses an MLP head to output a similarity score in $[0,1]$, enabling attention-based explanations. When integrated with a $k$-NN pipeline, CEViT achieves accuracy comparable to a ViT on MNIST ($99.0\%$ vs $99.1\%$) while preserving nearest-neighbor explanations and producing attention masks that highlight inter-class differences. The work points to future directions in harder datasets and counterfactual/semi-factual explanations to further enhance explainability and practical impact.

Abstract

This short paper presents preliminary research on the Case-Enhanced Vision Transformer (CEViT), a similarity measurement method aimed at improving the explainability of similarity assessments for image data. Initial experimental results suggest that integrating CEViT into k-Nearest Neighbor (k-NN) classification yields classification accuracy comparable to state-of-the-art computer vision models, while adding capabilities for illustrating differences between classes. CEViT explanations can be influenced by prior cases, to illustrate aspects of similarity relevant to those cases.

Case-Enhanced Vision Transformer: Improving Explanations of Image Similarity with a ViT-based Similarity Metric

TL;DR

CEViT introduces a ViT-based image similarity metric framed as a case-based reasoning approach. It forms a 2-channel input by concatenating the query with a reference image (

) and uses an MLP head to output a similarity score in

, enabling attention-based explanations. When integrated with a

-NN pipeline, CEViT achieves accuracy comparable to a ViT on MNIST (

) while preserving nearest-neighbor explanations and producing attention masks that highlight inter-class differences. The work points to future directions in harder datasets and counterfactual/semi-factual explanations to further enhance explainability and practical impact.

Abstract

Paper Structure (15 sections, 2 equations, 7 figures, 2 tables)

This paper contains 15 sections, 2 equations, 7 figures, 2 tables.

Introduction
Related Work
Explainable AI for Computer Vision
Applying CBR to Image Data
Similarity Metrics in CBR
The CEViT Method
Model design
Attention Mask
Using CEViT for classification
Experiments
Implementation Details
Quantitative Evaluation
Qualitative Evaluation
Conclusion and Future Work
Acknowledgements

Figures (7)

Figure 1: Explaining class differences.
Figure 2: The patchify process
Figure 3: Given an input image (or image-like tensor), the patchify process divides it into $N^2$ smaller image patches (in this example, $N=4$). These image tokens, along with a classification token, are then fed into the transformer model to produce the output.
Figure 4: Accessing the attention mask between the classification token and M image tokens in a transformer model with L encoders.
Figure 5: The quantitative evaluation process. Following the patchify process, image patches from both the query image and the distractor image are merged based on the normalized masks. These hybrid tokens are used to compute updated class likelihoods.
...and 2 more figures

Case-Enhanced Vision Transformer: Improving Explanations of Image Similarity with a ViT-based Similarity Metric

TL;DR

Abstract

Case-Enhanced Vision Transformer: Improving Explanations of Image Similarity with a ViT-based Similarity Metric

Authors

TL;DR

Abstract

Table of Contents

Figures (7)