Table of Contents
Fetching ...

Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

Kristoffer Wickstrøm, Teresa Dorszewski, Siyan Chen, Michael Kampffmeyer, Elisabeth Wetzer, Robert Jenssen

TL;DR

This work tackles the explainability gap in Vision Transformer (ViT) foundation models by introducing Keypoint Counting Classifiers (KCCs), a training-free self-explainable paradigm that turns pretrained ViTs into SEMs. KCCs identify image keypoints, match them to prototype keypoints via mutual nearest neighbors, and classify by counting matches, with explanations visualized as interpretable keypoints rather than bounding boxes or heatmaps. The authors validate KCCs through quantitative benchmarks and a comprehensive user study, showing improved explanation quality and user understanding, and demonstrate that vision-language labeling (VLParts) can automatically describe keypoints to reduce reader bias. The work highlights the flexibility and practicality of training-free SEMs for ViTs and points to future improvements in keypoint weighting and bias reduction. Overall, KCCs represent a meaningful step toward transparent, reliable ViT-based models suitable for safety-critical applications.

Abstract

Current approaches for designing self-explainable models (SEMs) require complicated training procedures and specific architectures which makes them impractical. With the advance of general purpose foundation models based on Vision Transformers (ViTs), this impracticability becomes even more problematic. Therefore, new methods are necessary to provide transparency and reliability to ViT-based foundation models. In this work, we present a new method for turning any well-trained ViT-based model into a SEM without retraining, which we call Keypoint Counting Classifiers (KCCs). Recent works have shown that ViTs can automatically identify matching keypoints between images with high precision, and we build on these results to create an easily interpretable decision process that is inherently visualizable in the input. We perform an extensive evaluation which show that KCCs improve the human-machine communication compared to recent baselines. We believe that KCCs constitute an important step towards making ViT-based foundation models more transparent and reliable.

Keypoint Counting Classifiers: Turning Vision Transformers into Self-Explainable Models Without Training

TL;DR

This work tackles the explainability gap in Vision Transformer (ViT) foundation models by introducing Keypoint Counting Classifiers (KCCs), a training-free self-explainable paradigm that turns pretrained ViTs into SEMs. KCCs identify image keypoints, match them to prototype keypoints via mutual nearest neighbors, and classify by counting matches, with explanations visualized as interpretable keypoints rather than bounding boxes or heatmaps. The authors validate KCCs through quantitative benchmarks and a comprehensive user study, showing improved explanation quality and user understanding, and demonstrate that vision-language labeling (VLParts) can automatically describe keypoints to reduce reader bias. The work highlights the flexibility and practicality of training-free SEMs for ViTs and points to future improvements in keypoint weighting and bias reduction. Overall, KCCs represent a meaningful step toward transparent, reliable ViT-based models suitable for safety-critical applications.

Abstract

Current approaches for designing self-explainable models (SEMs) require complicated training procedures and specific architectures which makes them impractical. With the advance of general purpose foundation models based on Vision Transformers (ViTs), this impracticability becomes even more problematic. Therefore, new methods are necessary to provide transparency and reliability to ViT-based foundation models. In this work, we present a new method for turning any well-trained ViT-based model into a SEM without retraining, which we call Keypoint Counting Classifiers (KCCs). Recent works have shown that ViTs can automatically identify matching keypoints between images with high precision, and we build on these results to create an easily interpretable decision process that is inherently visualizable in the input. We perform an extensive evaluation which show that KCCs improve the human-machine communication compared to recent baselines. We believe that KCCs constitute an important step towards making ViT-based foundation models more transparent and reliable.

Paper Structure

This paper contains 23 sections, 4 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: A demonstration of KCCs in the context of bird classification. KCCs identify matching keypoints between a query (leftmost image) and a set of prototypes (rightmost images). Only prototypes with matches are shown to avoid overloading the reader. Predictions are made by counting the number of matches. Here, we leverage ViTs with vision-language capabilities to automatically describe the keypoints vlpart. Note that the class names in the predictions are deliberately omitted to avoid readers using the class names instead of the explanation Kim2022HIVE.
  • Figure 2: Illustration of each step of KCCs. Going from column 4 to column 5, mutual NNs are computed between the keypoints in the query and all prototype keypoints. Only keypoints that are mutual NNs kept. In this case, prototype 3 has no mutual NNs with the query, and is therefore without keypoints.
  • Figure 3: Qualitative examples of KCC explanations.
  • Figure 4: Results from user study on user confidence as a function of agreement with explanation. The results show that KCCs allows people to be more confident in correcting the model.
  • Figure 5: Automated keypoints description using vision-language model
  • ...and 1 more figures