Case-Enhanced Vision Transformer: Improving Explanations of Image Similarity with a ViT-based Similarity Metric
Ziwei Zhao, David Leake, Xiaomeng Ye, David Crandall
TL;DR
CEViT introduces a ViT-based image similarity metric framed as a case-based reasoning approach. It forms a 2-channel input by concatenating the query with a reference image ($h\times w\times 2c$) and uses an MLP head to output a similarity score in $[0,1]$, enabling attention-based explanations. When integrated with a $k$-NN pipeline, CEViT achieves accuracy comparable to a ViT on MNIST ($99.0\%$ vs $99.1\%$) while preserving nearest-neighbor explanations and producing attention masks that highlight inter-class differences. The work points to future directions in harder datasets and counterfactual/semi-factual explanations to further enhance explainability and practical impact.
Abstract
This short paper presents preliminary research on the Case-Enhanced Vision Transformer (CEViT), a similarity measurement method aimed at improving the explainability of similarity assessments for image data. Initial experimental results suggest that integrating CEViT into k-Nearest Neighbor (k-NN) classification yields classification accuracy comparable to state-of-the-art computer vision models, while adding capabilities for illustrating differences between classes. CEViT explanations can be influenced by prior cases, to illustrate aspects of similarity relevant to those cases.
